# Mobile Apps for Google Play and App Store
This is a project to let the developers know about the types of apps that are the most attractive by the users.

My goal as a data analyst in this project is to make it easy to see what is demanded by users. 

In [13]:
opened_data=open('AppleStore.csv')
from csv import reader
read_file=reader(opened_data)
apple_data=list(read_file)

opened_data2=open('googleplaystore.csv')
from csv import reader
read_file2=reader(opened_data2)
google_data=list(read_file2)

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        

Above, we opened the two data sets. And then wrote a function, named *explore_data* that explores the data.  The *explore_data* function receives four parameters. the data set, the starting row index to show, the ending row index to show, and the number of columns and rows.

Below we see the first two rows of Apple Store dataset. The data set includes 7198 rows and 16 columns. For more details about this dataset, you can check [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps). 


In [18]:
explore_data(apple_data,0,2,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16


The first two rows of the Google Play Store dataset is shown below. The dataset includes 10842 rows and 13 columns. For more details about this dataset, you can check [here](https://www.kaggle.com/lava18/google-play-store-apps).

In [17]:
explore_data(google_data,0,2, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


## Data Cleaning
Now, we need to make sure that our data is accurate. For this we will
1. Detect inaccurate data, and correct or remove it.
2. Detect duplicate data, and remove the duplicates

We read the discussions about this dataset in the website we have taken this from and seen that there is a row with wrong information. The index number of the row is said to be 10473. We show that row below, and then **delete the row**. 

In [19]:
print(google_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [21]:
del (google_data[10473])

In [24]:
explore_data(google_data,0,2,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


As we can see above, after the row deletion, the number of rows left in the google dataset is 10841. 

We didnt see anything in the discussion part of the apple dataset about any rows with wrong data.

We have also recognized that there are duplicate entries. Below are two examples from Google dataset.

In [25]:
for app in google_data:
    name=app[0]
    if name=='Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [26]:
for app in google_data:
    name=app[0]
    if name=='Subway Surfers':
        print(app)

['Subway Surfers', 'GAME', '4.5', '27722264', '76M', '1,000,000,000+', 'Free', '0', 'Everyone 10+', 'Arcade', 'July 12, 2018', '1.90.0', '4.1 and up']
['Subway Surfers', 'GAME', '4.5', '27723193', '76M', '1,000,000,000+', 'Free', '0', 'Everyone 10+', 'Arcade', 'July 12, 2018', '1.90.0', '4.1 and up']
['Subway Surfers', 'GAME', '4.5', '27724094', '76M', '1,000,000,000+', 'Free', '0', 'Everyone 10+', 'Arcade', 'July 12, 2018', '1.90.0', '4.1 and up']
['Subway Surfers', 'GAME', '4.5', '27725352', '76M', '1,000,000,000+', 'Free', '0', 'Everyone 10+', 'Arcade', 'July 12, 2018', '1.90.0', '4.1 and up']
['Subway Surfers', 'GAME', '4.5', '27725352', '76M', '1,000,000,000+', 'Free', '0', 'Everyone 10+', 'Arcade', 'July 12, 2018', '1.90.0', '4.1 and up']
['Subway Surfers', 'GAME', '4.5', '27711703', '76M', '1,000,000,000+', 'Free', '0', 'Everyone 10+', 'Arcade', 'July 12, 2018', '1.90.0', '4.1 and up']


We write a function that will detect the duplicate entries. 

In [39]:
duplicate_apps =[]
unique_apps=[]
for app in google_data[1:]:
    name=app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps: ', duplicate_apps[:15])
print('\n')
print('Number of unique apps: ', len(unique_apps)) 

Number of duplicate apps:  1181


Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Number of unique apps:  9659


So we found that there are **9659 unique apps**. And there are totally 1881 duplicates of those unique apps. 

We will remove the duplicate rows. We will keep the most recent data for each app and remove the older ones. For doing this, we will check the **'Reviews'**. The highest review number would give us the most recent data and hence we will keep the rows with the highest number of reviews for each app. 

**Removing Duplicate Apps**

Below we write a function, that creates a dictionary in which we pair the names and max number of revies of unique apps. 

In [43]:
reviews_max={} #name and max reviews
for app in google_data[1:]:
    name=app[0]
    n_reviews=float(app[3])
    if name in reviews_max and n_reviews>reviews_max[name]:
        reviews_max[name]=n_reviews
    elif name not in reviews_max:
        reviews_max.update({name:n_reviews})
print(len(reviews_max))


9659


Now we will delete the unwanted duplicates. For this we will actually crerate a new dataset (list of lists) named android_clean and it will only stire the wanted rows.  

In [47]:
android_clean=[]
already_added=[]
for app in google_data[1:]:
    name=app[0]
    n_reviews=float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
explore_data(android_clean,0,2,True)        


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9659
Number of columns: 13


**Removing the apps that are not for English speakers**
We decided to filter the apps according to their names. We will remove the apps which have at least 4 non_English characters. Although this is not a perfect filter, it should be fairly effective. 

The function below, named *lang_check* is defined for this.

In [106]:
def lang_check(string): 
    countt=0
    for character in string:    #çağrı   爱奇艺PPS -《欢乐颂2》电视剧热播
        if countt>3:
            return False
        elif ord(character) > 127:
            countt+=1
    if countt<=3:
        return True
lang_check('hello')


True

Below we will filter out both data sets according to being either English or not with our filter function *lang_check*.

In [110]:
android_checked=[]
apple_checked=[]

for app in android_clean:
    if lang_check(app[0])==True:
        android_checked.append(app)
        
for app in apple_data[1:]:
    if lang_check(app[0])==True:
        apple_checked.append(app)       


explore_data(android_checked, 0,2,True)
print('\n')
explore_data(apple_checked, 0,2,True)
       

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


At the end of removing the non_english apps, we have 9614 unique apps in google dataset and 7197 apps in apple dataset. 

Next we will remove the apps which are not free from our lists. 

In [121]:
android=[]
apple=[]

for app in android_checked:
    price =float(app[7])
    if price==0:
        android.append(app)
        
for app in apple_checked:
    price=float(app[4])
    if price==0:
        apple.append(app)
        
explore_data(android, 0,2,True)
print('\n')
explore_data(apple, 0,2,True)
    

ValueError: could not convert string to float: '$4.99'