#  Suspecious Apps Detection - Machine Learning and Token Analysis

<br />Mengchuan (Mike) Fu mfu10@fordham.edu February 2017 (updated 3/09/2017)<br /><br />
For the Mobile Safety Research<br /><br />

## Installing scikit-learn

Option 1: Install scikit-learn library and dependencies (NumPy and SciPy)<br />
Option 2: Install Anaconda distribution of Python, which includes:
- Hundreds of useful packages (including scikit-learn)
- IPython and IPython Notebook
- conda package manager
- Spyder IDE

## Objective

The main objective is to monitor age rating on iOS and spy on the suspicious apps on iOS that are largely prone to mis-rated<br /><br />
What this notebook does:<br />
1. Import the data, examine the shape and distribution
2. Randomly split data into training and testing sets
3. Data vectorization: include only 1-gram and 2-grams; ignore terms that apear in more than 50% of the documents; only keep terms that appear in at least 2 documents
4. Generate document-term matrix
5. Build and evaluat models (Naive Bayes and Logestic Regression)
6. Examine Tokens (Ratio based)

## Dataset Overview 

44840 Apps data with titile, description and maturity rating<br />
Crawled from Apple App Store

## Exploring the Dataset

In [1]:
import nltk
import pandas as pd

# read files
df = pd.read_csv("ml-dataset.csv",header = None, names=['label', 'description'])
df.head()

Unnamed: 0,label,description
0,12,Download the best Slot experience for free to...
1,12,Download the best Slot experience for free to...
2,12,Download the best Slot experience for free to...
3,12,Download the best Slot experience for free to...
4,12,Download the best Slot experience for free to...


In [2]:
# examine the shape
df.shape

(44840, 2)

In [3]:
# examine the class distribution
df.label.value_counts()

4     31486
12     8026
9      3479
17     1849
Name: label, dtype: int64

In [4]:
# convert label to a consecutive numerical variabel
df['label_num'] = df.label.map({4:0,9:1,12:2,17:3})

In [5]:
# check that the conversion worked
#df.head(10)

Unnamed: 0,label,description,label_num
0,12,Download the best Slot experience for free to...,2
1,12,Download the best Slot experience for free to...,2
2,12,Download the best Slot experience for free to...,2
3,12,Download the best Slot experience for free to...,2
4,12,Download the best Slot experience for free to...,2
5,12,Download the best Slot experience for free to...,2
6,12,Download the best Slot experience for free to...,2
7,12,Download the best Slot experience for free to...,2
8,12,Download the best Slot experience for free to...,2
9,12,Download the best Slot experience for free to...,2


In [6]:
X = df.description
y = df.label_num
print(X.shape)
print(y.shape)

(44840,)
(44840,)


In [7]:
# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(33630,)
(11210,)
(33630,)
(11210,)




## Vectorizing the dataset

In [8]:
# instantiate the vectorizer and remove stop words 
# include only 1-gram and 2-grams 
# ignore terms that apear in more than 50% of the documents
# only keep terms that appear in at least 2 documents
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words='english',ngram_range=(1,2),max_df=0.5,min_df=2)

In [9]:
# fit and transform
X_train_dtm = vect.fit_transform(X_train)

In [10]:
# examine the document-term matrix
X_train_dtm

<33630x346189 sparse matrix of type '<class 'numpy.int64'>'
	with 3552116 stored elements in Compressed Sparse Row format>

In [11]:
# transform testing data into a document-term matrix
X_test_dtm = vect.transform(X_test.values.astype('U'))
X_test_dtm

<11210x346189 sparse matrix of type '<class 'numpy.int64'>'
	with 1075118 stored elements in Compressed Sparse Row format>

## Building and evaluating a model (Naive Bayes)

In [12]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [13]:
# train the model using X_train_dtm 
%time nb.fit(X_train_dtm, y_train)

Wall time: 153 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [14]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [15]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.84317573595004458

In [16]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[7540,  222,   53,   40],
       [ 499,  333,   24,   12],
       [ 351,  185, 1426,   62],
       [ 145,   36,  129,  153]])

In [17]:
# print message text for the false positives
X_test[(y_pred_class==1)&(y_test==0)]

42600     #1 Iron Maiden Trivia!!! Iron Maiden is the #...
10638    FULL VERSION, WITHOUT ADS!It's time to build y...
10904     Battle for Galaxy has begun! Galaxy Empire Wa...
30855     Its the battle of the Presidents to be  Trump...
35632     Tank Hero Classic: The ultimate tank battle.F...
36946     Be a Angry Snow Fox and attack Animals in thi...
12928     Halloween edition of Fortress Defense.Hallows...
35579    Bomb enemy tanks with cool weapons such as ray...
35492    Tank Battle Go! is an action game you into qui...
43986     All markers are available for free on http://...
28626     Black-eyed, chubby-figured, short-limbed...An...
10567     Now its your chance to defend your planet and...
35719     "Tank fire" with tanks for the protagonist, t...
24674     A big spaceship fleet is attacking you."A one...
35986     UFO Space is an addicting 2D space shooter, s...
40210     WARLOCK WOODS is a new, simple to play tower ...
44531     Can you help Milton, the resident town Zombie.

In [18]:
# print message text for the false negatives
X_test[y_pred_class < y_test]

8236     Egg Mania is a simple yet entertaining game de...
39530     The goal - to drop the bricks on the wall cra...
11014     Der Mega-Tippspa in der grten Community!Mit G...
25054     O Patriota: Detonando a EsquerdaJogo de polti...
35560     In Tank Cavalry - Panzer Destroyer, you are a...
39402     Explore an amazing world with the power of yo...
6689      Experience an epic RPG adventure in the palm ...
43380     Get the complete version: https://itunes.appl...
33075     Duel type game. Defeat your enemies in mediev...
13225     Hello girls!! If you are crazy about horror c...
41639     A more challenging version of Tic TAC Toe! Pr...
41239     Use Xbox 360 SmartGlass to enhance your enter...
31956     Be the Almighty Gem Mover in this addictive M...
22428     BECOME THE LEADER OF YOUR OWN MAFIAThe battle...
25718     The strangely moving tale of a small oden car...
44222     Few Steps before you Enjoy a new immersive Au...
6873      Eagle Wings - Flappy style adventure , with  .

In [18]:
nb.predict_proba(X_test_dtm)

array([[  1.00000000e+000,   2.88113750e-018,   4.87381507e-024,
          3.54659276e-021],
       [  1.00000000e+000,   7.10281237e-038,   2.52571439e-065,
          3.87059211e-064],
       [  1.69525606e-107,   1.00000000e+000,   4.08408217e-109,
          8.21348608e-140],
       ..., 
       [  0.00000000e+000,   0.00000000e+000,   1.00000000e+000,
          4.55103395e-214],
       [  1.00000000e+000,   1.61227874e-015,   1.81563981e-024,
          5.59025042e-026],
       [  1.00000000e+000,   1.38205699e-056,   1.40769380e-086,
          3.19026087e-043]])

In [19]:
# calculate predicted probabilities for X_test_dtm
y_pred_prob = nb.predict_proba(X_test_dtm)
y_pred_prob

array([[  1.00000000e+000,   2.88113750e-018,   4.87381507e-024,
          3.54659276e-021],
       [  1.00000000e+000,   7.10281237e-038,   2.52571439e-065,
          3.87059211e-064],
       [  1.69525606e-107,   1.00000000e+000,   4.08408217e-109,
          8.21348608e-140],
       ..., 
       [  0.00000000e+000,   0.00000000e+000,   1.00000000e+000,
          4.55103395e-214],
       [  1.00000000e+000,   1.61227874e-015,   1.81563981e-024,
          5.59025042e-026],
       [  1.00000000e+000,   1.38205699e-056,   1.40769380e-086,
          3.19026087e-043]])

In [20]:
# AUC curve function not applicable

## Logistic regression

In [21]:
# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression 
logreg = LogisticRegression()

In [22]:
# train the model using X_train_dtm
%time logreg.fit(X_train_dtm, y_train)

Wall time: 54.6 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [23]:
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)

In [24]:
# calculate predicted probabilities for X_test_dtm
y_pred_prob = logreg.predict_proba(X_test_dtm)
y_pred_prob

array([[  9.90372330e-01,   3.90616299e-03,   2.72871271e-03,
          2.99279391e-03],
       [  9.99653760e-01,   3.31956864e-04,   9.51675464e-06,
          4.76636444e-06],
       [  1.71223195e-02,   9.06308119e-01,   7.65681852e-02,
          1.37625207e-06],
       ..., 
       [  1.11287984e-14,   8.74415356e-16,   9.99829927e-01,
          1.70073001e-04],
       [  9.98733197e-01,   3.35495298e-05,   1.02210923e-04,
          1.13104276e-03],
       [  9.93369558e-01,   4.75501380e-03,   1.65050958e-04,
          1.71037767e-03]])

In [25]:
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)

0.86048171275646745

In [26]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[7612,  120,  109,   14],
       [ 478,  308,   71,   11],
       [ 318,  108, 1537,   61],
       [ 141,   19,  114,  189]])

## Examing the "tokens"

In [27]:
# store the vocabulary of X_train
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

346189

In [28]:
# number of times each token appears across all 4 years old apps
four_token_count = nb.feature_count_[0,:]

In [29]:
# number of times each token appears across all 9 years old apps
nine_token_count = nb.feature_count_[1,:]

In [30]:
# number of times each token appears across all 12 years old apps
twelve_token_count = nb.feature_count_[2,:]

In [31]:
# number of times each token appears across all 17 years old apps
seventeen_token_count = nb.feature_count_[3,:]

In [32]:
tokens = pd.DataFrame({'token':X_train_tokens,'four':four_token_count,
                      'nine':nine_token_count, 
                      'twelve':twelve_token_count,
                      'seventeen':seventeen_token_count})

In [33]:
tokens.to_csv("tokens in different classes.csv")
tokens

Unnamed: 0,four,nine,seventeen,token,twelve
0,52.0,3.0,3.0,00,9.0
1,15.0,0.0,0.0,00 00,1.0
2,4.0,0.0,2.0,00 000,4.0
3,2.0,0.0,0.0,00 limited,0.0
4,6.0,0.0,0.0,00 nov,0.0
5,2.0,0.0,0.0,00 oct,0.0
6,283.0,39.0,48.0,000,268.0
7,32.0,6.0,10.0,000 000,41.0
8,1.0,0.0,0.0,000 10,5.0
9,2.0,0.0,0.0,000 app,0.0


In [34]:
# derive the high ratio age-four tokens
tokens['ratio'] = tokens.four/(tokens.four+tokens.nine+tokens.twelve+tokens.seventeen)
age_four_tokens = tokens.sort_values(['ratio','four'],ascending=False)
    age_four_tokens.to_csv("age_four_tokens.csv")

In [35]:
# derive the high ratio age-nine tokens
tokens['ratio'] = tokens.nine/(tokens.four+tokens.nine+tokens.twelve+tokens.seventeen)
age_nine_tokens = tokens.sort_values(['ratio','nine'],ascending=False)
age_nine_tokens.to_csv("age_nine_tokens.csv")

In [36]:
# derive the high ratio age-twelve tokens
tokens['ratio'] = tokens.twelve/(tokens.four+tokens.nine+tokens.twelve+tokens.seventeen)
age_twelve_tokens = tokens.sort_values(['ratio','twelve'],ascending=False)
age_twelve_tokens.to_csv("age_twelve_tokens.csv")

In [37]:
# derive the high ratio age-seventeen tokens
tokens['ratio'] = tokens.seventeen/(tokens.four+tokens.nine+tokens.twelve+tokens.seventeen)
age_seventeen_tokens = tokens.sort_values(['ratio','seventeen'],ascending=False)
age_seventeen_tokens.to_csv("age_seventeen_tokens.csv")