#  Suspecious Apps Detection - 10 Nearest Neighbor Approach 

<br />Mengchuan (Mike) Fu mfu10@fordham.edu February 2017 (updated 3/30/2017)<br /><br />
For the Mobile Safety Research<br /><br />

## Objective

The main objective is to verify the possibility of using k-nn method in the maturity rating research and find suspicious apps for the use of future Mturk survey<br /><br />
What this notebook does:<br />
1. Import the data, examine the shape and distribution
2. Data preprocessing: regular expression, lowercase, remove stop words
3. Randomly selection
4. Count the amount of overlapes words on descriptions for each pair of Apps among our selected dataset
5. For each app, select top 10 apps that have max overlaps words on description
6. Predict true maturity rating for each app by "top 10 apps" ratings (Majority Voting)
7. Further analysis

## Dataset Overview 

44840 Apps data with titile, description and maturity rating<br />
Crawled from Apple App Store

## Exploring the Dataset - Pandas


We'll use various python toolkits in this notebook. <br /><br />
To start we will load<br />
- pandas: dataframe data structure based on the one in R, along with vectorized routines for fast manipulation of the data<br />
- numpy: various math tools, especially for constructing and working with multidimensional arrays.<br />
- nltk: toolkit for building Python programs to work with human language data“an amazing library to play with natural language.”<br />

In [1]:
import nltk
import pandas as pd
from nltk.corpus import stopwords
import string
import numpy as np

In [2]:
# read data
df = pd.read_csv('10-nn dataset.csv')

In [3]:
# examine the shape
df.shape

(44840, 3)

In [4]:
# examine the class distribution
df.Age.value_counts()

4     31486
12     8026
9      3479
17     1849
Name: Age, dtype: int64

In [5]:
# check that the conversion worked
df.head(10)

Unnamed: 0,name,Description,Age
0,'A A 777 My Slots Machines Rich FREE',Download the best Slot experience for free to...,12
1,'A A 777 My Slots Machines Vegas Casino',Download the best Slot experience for free to...,12
2,'A A 777 My Slots Rich Casino Amazing',Download the best Slot experience for free to...,12
3,'A A 777 My Slots Rich Casino Vegas',Download the best Slot experience for free to...,12
4,'A A 777 My Vegas Casino Slots FREE',Download the best Slot experience for free to...,12
5,'A A 777 My Vegas Classic Slots Machines FREE',Download the best Slot experience for free to...,12
6,'A A A 777 Abing',Download the best Slot experience for free to...,12
7,'A A a 777 abingo Casino',Download the best Slot experience for free to...,12
8,'A a A 777 ABluto Coins',Download the best Slot experience for free to...,12
9,'A A A 777 Abuse Casino Vegas',Download the best Slot experience for free to...,12


## Randomly Selection

In [6]:
import random

# random select "size" apps from the dataset
X = df.Description
size = 10000
ran = random.sample(range(0, len(X)),size)
# ran

In [7]:
df1 = df.loc[ran]
df1.index = list(range(0, size))
X = df1.Description
df1.head(10)

Unnamed: 0,name,Description,Age
0,'AA SKATE GAMES: Street Sessions 2012',"Test your skill, speed, stamina, and mind in ...",12
1,'Cake Pop Maker Free - Dessert & Fruit Decoart...,"Cup cakes, chocolates, ice cream and other ba...",4
2,'Fallen Shadows: Coming Home - A Hidden Object...,A family member has gone missing in New Orlea...,9
3,'Edna & Harvey: The Breakout',The award-winning debut adventure game from t...,9
4,'Dazzling Night',Let's make the flower bloom at night when the ...,4
5,'Tangled 3d 2048 super cool brain teasing game',"Many version of 2048 games are created, howeve...",4
6,'Dark Shadow of Liberty HD',*** $4.99 -> $2.99 TODAY ONLY! GET THE MOST V...,4
7,'R.P.S.25',"How to playJoin panels, in the order of Rock-...",4
8,'Quick Spell',"Challenge your friends in this fast paced, ad...",4
9,'Water Bottle Slip Away Talent Show- Best Chal...,Now you Can have fun even more by doing the f...,4


## Data Preprocessing

In [8]:
from nltk.tokenize import *
from nltk.corpus import *
from nltk.tokenize import RegexpTokenizer

# Define the data preprocessing function: regular expression, lowercase, remove stop words
def prep(des):
    tokenizer = RegexpTokenizer(r'\w+')
    des = tokenizer.tokenize(des)
    des = [des.lower() for des in des]
    stop_words = set(stopwords.words("english"))
    des = [w for w in des if not w in stop_words]
    return des

In [9]:
# preprocess description data
Y = []
for i in range(0, size):
    Y.append(prep(X[i]))

## 10 Nearest Neighbor 

In [10]:
# Count the amount of overlapes words on descriptions for each pair of Apps among our selected dataset
Y_count = []
for j in range(0,size):
    Y_count.append(0)
    Y_count[j] = []
    
    for i in range(0,size):
        Y_count[j].append(len(list(set(Y[j]) & set(Y[i]))))
# Y_count

In [11]:
# len(Y_count)

In [12]:
# Generate 10-nn index for each app
ind = []
for i in range(0,size):
    ind.append(np.argpartition(Y_count[i], -10)[-10:])
#ind

In [13]:
# df1.Age[ind[2]]

## True Rating Prediction

In [14]:
# Define majority voting function

def find_majority(k):
    myMap = {}
    maximum = ( '', 0 ) # (occurring element, occurrences)
    for n in k:
        if n in myMap: myMap[n] += 1
        else: myMap[n] = 1

        # Keep track of maximum on the go
        if myMap[n] > maximum[1]: maximum = (n,myMap[n])

    return maximum

In [15]:
# Predict true rating for each apps
predict = []
for i in range(0,len(Y)):
    predict.append(find_majority(df1.Age[ind[i]])[0])
# len(predict)

In [16]:
# Append the result to our dataframe
df1['predict'] = predict

In [17]:
# df1.to_csv('10-nn prediction.csv')

In [18]:
# t=pd.read_csv('10-nn prediction.csv')

In [20]:
# Compute accuracy
# sum(t.Age ==t.predict)

In [37]:
df2 = df1.loc[t.Age !=t.predict]
df2.head(10)

Unnamed: 0,name,Description,Age,predict
20,'Dasher Dan PRO - Zombie Monkey Island',Escape from Zombie Monkey Island! Only Dasher...,9,12
23,'Da Vinci Pinball',Answering the calls for another stellar pinbal...,4,12
28,'Danger Dodgers',Danger Dodgers - Endless End Of The WorldChoos...,9,4
35,'DARK - STAR',Dark Star is an epic arcade shoot-em-up in de...,12,9
36,'Easy Math For Kids - Free',Easy Math For KidsHow fast can you count?Bett...,4,12
48,'Dark Chess - The Way of Kings',"If you like to play Chinese Dark Chess, you h...",4,12
51,'Fall Down 100',"It's an easy and excited game, drive your pilo...",9,4
55,'Camel Ride Halloween',Enjoy an adventure with Pakalu Papito and Cam...,9,4
61,'Cali CowChat',Things just sound better (and funnier) coming...,17,4
72,'Cabola Monster Pixel Adventure',Clear as many monsters out of the town as you ...,9,4


In [38]:
df2.Age.value_counts()

9     392
12    388
17    181
4     142
Name: Age, dtype: int64

In [39]:
df1.Age.value_counts()

4     6642
12    2441
9      620
17     297
Name: Age, dtype: int64