## Exploratory Data Analysis for Play Store and App Store apps

Play Store and App Store has about millions of applications which are downloaded by billions across the world. Data about these applications can help has better understand what people love most and spend their time on; what are the daily needs of the people; where are they inclining their habits etc. They not only make better products but also better ideas for products.

This notebook is focussed on data cleaning and exploring using python:

#### Pretty print datarows

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
from csv import reader

In [3]:
apps_data = list(reader(open('AppleStore.csv')))
plays_data = list(reader(open('googleplaystore.csv')))

#### Apple store Information provided in the dataset

In [4]:
apps_data[0]

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

#### Google Play  store Information provided in the dataset

In [5]:
plays_data[0]

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [6]:
explore_data(apps_data,1,5,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


In [7]:
explore_data(plays_data,1,5,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


In [8]:
explore_data(plays_data,10473,10474,False)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




#### Removing anomlies

In [9]:
if len(plays_data[10473]) != len(plays_data[0]):
    del plays_data[10473]

#### Duplicate Apps registered on Play Store

In [10]:
duplicate_apps = []
unique_apps = []

for app in plays_data[1:]:
    name = app[0]
    if name in unique_apps:
        if name not in duplicate_apps:
            duplicate_apps.append(name)
    else:
        unique_apps.append(name) 

In [11]:
len(duplicate_apps)

798

#### Sample rows with Duplicate App
It seems to be there are duplicate rows present in the data which needs to be removed as it may add bias in our model

In [12]:
for app in plays_data[1:]:
    if app[0] == duplicate_apps[0]:
        print(app)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


In [13]:
# Max Reviews for unique app name
reviews_max = {}
for app in plays_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max:
        reviews_max[name] = max(reviews_max[name],n_reviews)
    else:
        reviews_max[name] = n_reviews

#### Removing Duplicate app 
Duplicate apps are removed and the rows with maximum reviews are kept in the record

In [14]:
android_clean = []
already_added = []
android_clean.append(plays_data[0])
for app in plays_data[1:]:
    name = app[0]
    n_review = float(app[3])
    if n_review == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

In [15]:
len(android_clean)

9660

In [16]:
# check if a string is english or not
def check_english(string):
    c = 0
    for ch in string:
        if ord(ch) > 127:
            c += 1
    if c > 3:
        return False
    return True

In [17]:
check_english('Instachat 😜')

True

In [18]:
check_english('Docs To Go™ Free Office Suite')

True

In [19]:
check_english('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [20]:
check_english('Instagram')

True

### Data Analysis

##### 1.  Number of english native apps present in the dataset

In [21]:
def get_lang_updated_dataset(dataset):
    new_dataset = []
    new_dataset.append(dataset[0])
    for app in dataset[1:]:
        if check_english(app[0]):
            new_dataset.append(app)
    return new_dataset

In [22]:
updated_apps_data = get_lang_updated_dataset(apps_data)
updated_plays_data = get_lang_updated_dataset(plays_data)

In [23]:
len(updated_apps_data)

7198

In [24]:
len(updated_plays_data)

10796

##### 2. Number of free apps on play store and app store

In [25]:
def get_free_updated_dataset(dataset,idx):
    new_dataset = []
    new_dataset.append(dataset[0])
    for app in dataset[1:]:
        name = app[0]
        try:
            price = float(app[idx])
            if price == 0.0:
                new_dataset.append(app)
        except:
            pass
    return new_dataset

In [26]:
updated_apps_data = get_free_updated_dataset(apps_data,4)
updated_plays_data = get_free_updated_dataset(plays_data,7)

In [27]:
len(updated_apps_data)

4057

In [28]:
len(updated_plays_data)

10041

In [29]:
def freq_table(dataset,index):
    table = {}
    for app in dataset[1:]:
        val = app[index]
        if val in table:
            table[val] += 1
        else:
            table[val] = 1
    for k in table:
        table[k] = (table[k]*100)/len(dataset[1:])
    return table

In [30]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

##### 3. Percentage of Apps on App Store based on Prime Genre

In [31]:
display_table(apps_data,11)

Games : 53.66124774211477
Entertainment : 7.433652910935112
Education : 6.294289287203001
Photo & Video : 4.849242740030569
Utilities : 3.445880227872725
Health & Fitness : 2.501042100875365
Productivity : 2.473252744198972
Social Networking : 2.3204112824788106
Lifestyle : 2.0008336807002918
Music : 1.917465610671113
Shopping : 1.6951507572599693
Sports : 1.5839933305543976
Book : 1.5562039738780047
Finance : 1.445046547172433
Travel : 1.1254689453939142
News : 1.0421008753647354
Weather : 1.0004168403501459
Reference : 0.8892594136445742
Food & Drink : 0.8753647353063777
Business : 0.7919966652771988
Navigation : 0.6391552035570377
Medical : 0.31957760177851885
Catalogs : 0.1389467833819647


##### 4. Percentage of Apps on Play Store based on Category

In [32]:
display_table(plays_data,1)

FAMILY : 18.19188191881919
GAME : 10.55350553505535
TOOLS : 7.776752767527675
MEDICAL : 4.271217712177122
BUSINESS : 4.243542435424354
PRODUCTIVITY : 3.911439114391144
PERSONALIZATION : 3.6162361623616235
COMMUNICATION : 3.5701107011070112
SPORTS : 3.5424354243542435
LIFESTYLE : 3.5239852398523985
FINANCE : 3.376383763837638
HEALTH_AND_FITNESS : 3.1457564575645756
PHOTOGRAPHY : 3.0904059040590406
SOCIAL : 2.7214022140221403
NEWS_AND_MAGAZINES : 2.61070110701107
SHOPPING : 2.3985239852398523
TRAVEL_AND_LOCAL : 2.3800738007380073
DATING : 2.158671586715867
BOOKS_AND_REFERENCE : 2.1309963099630997
VIDEO_PLAYERS : 1.6143911439114391
EDUCATION : 1.4391143911439115
ENTERTAINMENT : 1.3745387453874538
MAPS_AND_NAVIGATION : 1.2638376383763839
FOOD_AND_DRINK : 1.1715867158671587
HOUSE_AND_HOME : 0.8118081180811808
LIBRARIES_AND_DEMO : 0.7841328413284133
AUTO_AND_VEHICLES : 0.7841328413284133
WEATHER : 0.7564575645756457
ART_AND_DESIGN : 0.5996309963099631
EVENTS : 0.5904059040590406
PARENTING : 

##### 5. Percentage of Apps on Play Store based on Genre

In [33]:
display_table(plays_data,9)

Tools : 7.767527675276753
Entertainment : 5.747232472324724
Education : 5.064575645756458
Medical : 4.271217712177122
Business : 4.243542435424354
Productivity : 3.911439114391144
Sports : 3.6715867158671585
Personalization : 3.6162361623616235
Communication : 3.5701107011070112
Lifestyle : 3.514760147601476
Finance : 3.376383763837638
Action : 3.367158671586716
Health & Fitness : 3.1457564575645756
Photography : 3.0904059040590406
Social : 2.7214022140221403
News & Magazines : 2.61070110701107
Shopping : 2.3985239852398523
Travel & Local : 2.370848708487085
Dating : 2.158671586715867
Books & Reference : 2.1309963099630997
Arcade : 2.029520295202952
Simulation : 1.845018450184502
Casual : 1.7804428044280443
Video Players & Editors : 1.5959409594095941
Puzzle : 1.2915129151291513
Maps & Navigation : 1.2638376383763839
Food & Drink : 1.1715867158671587
Role Playing : 1.0055350553505535
Strategy : 0.9870848708487084
Racing : 0.9040590405904059
House & Home : 0.8118081180811808
Libraries &

##### 6. Average User Rating for apps of each genre on App Store

In [34]:
unique_genre = freq_table(apps_data,11).keys()
for genre in unique_genre:
    total = 0
    len_genre = 0
    for app in apps_data[1:]:
        genre_app = app[11]
        if genre_app == genre:
            total += float(app[7])
            len_genre += 1
    average = total/len_genre
    print(genre)
    print(average)

Social Networking
2.9850299401197606
Photo & Video
3.8008595988538683
Games
3.6850077679958573
Music
3.9782608695652173
Reference
3.453125
Health & Fitness
3.7
Weather
3.5972222222222223
Utilities
3.278225806451613
Travel
3.376543209876543
Shopping
3.540983606557377
News
2.98
Navigation
2.6847826086956523
Lifestyle
2.8055555555555554
Entertainment
3.2467289719626167
Food & Drink
3.1825396825396823
Sports
2.982456140350877
Book
2.4776785714285716
Finance
2.4326923076923075
Education
3.376379690949227
Productivity
4.00561797752809
Business
3.745614035087719
Catalogs
2.1
Medical
3.369565217391304


##### 7. Average User Rating for apps of each category on Play Store

In [35]:
cat_freq_table = freq_table(plays_data,1)
for category in cat_freq_table:
    total = 0
    len_category = 0
    for app in plays_data[1:]:
        category_app = app[1]
        if category_app == category:
            rating = app[2]
            if rating != 'NaN':
                rating = float(rating)
                total += rating
                len_category += 1
    average = total/len_category
    print(category)
    print(average)

ART_AND_DESIGN
4.358064516129031
AUTO_AND_VEHICLES
4.19041095890411
BEAUTY
4.278571428571428
BOOKS_AND_REFERENCE
4.346067415730338
BUSINESS
4.121452145214522
COMICS
4.155172413793104
COMMUNICATION
4.158536585365852
DATING
3.9707692307692306
EDUCATION
4.389032258064517
ENTERTAINMENT
4.126174496644294
EVENTS
4.435555555555557
FINANCE
4.131888544891644
FOOD_AND_DRINK
4.1669724770642205
HEALTH_AND_FITNESS
4.2771043771043775
HOUSE_AND_HOME
4.197368421052633
LIBRARIES_AND_DEMO
4.178461538461538
LIFESTYLE
4.094904458598724
GAME
4.2863263445761195
FAMILY
4.192272467086437
MEDICAL
4.18914285714286
SOCIAL
4.255598455598457
SHOPPING
4.259663865546221
PHOTOGRAPHY
4.192113564668767
SPORTS
4.223510971786835
TRAVEL_AND_LOCAL
4.10929203539823
TOOLS
4.047411444141691
PERSONALIZATION
4.335987261146501
PRODUCTIVITY
4.211396011396012
PARENTING
4.300000000000001
WEATHER
4.243999999999999
VIDEO_PLAYERS
4.063750000000001
NEWS_AND_MAGAZINES
4.1321888412017165
MAPS_AND_NAVIGATION
4.051612903225806
