### Goal
our aim is to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store.

### About Dataset

**googleplaystore.csv** - dataset containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. 

**Applestore.csv** - dataset containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. 


### Loading Data

In [2]:
#Loading the data
import pandas as pd
#Google playstore dataset
android = pd.read_csv('Data/googleplaystore.csv')


#Apple store dataset
apple = pd.read_csv('Data/AppleStore.csv')


### Data Exploration

#### Columns in the dataset

In [3]:
android.columns.values

array(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'], dtype=object)

In [4]:
#Using a function to print columns
for cols in android.columns:
    print(cols)

App
Category
Rating
Reviews
Size
Installs
Type
Price
Content Rating
Genres
Last Updated
Current Ver
Android Ver


In [5]:
apple.columns.values

array(['id', 'track_name', 'size_bytes', 'currency', 'price',
       'rating_count_tot', 'rating_count_ver', 'user_rating',
       'user_rating_ver', 'ver', 'cont_rating', 'prime_genre',
       'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'],
      dtype=object)

In [6]:
for cols in apple.columns:
    print(cols)

id
track_name
size_bytes
currency
price
rating_count_tot
rating_count_ver
user_rating
user_rating_ver
ver
cont_rating
prime_genre
sup_devices.num
ipadSc_urls.num
lang.num
vpp_lic


#### Shape of the datasets

In [7]:
android.shape

(10841, 13)

In [8]:
apple.shape

(7197, 16)

#### Dataset information

In [9]:
android.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [10]:
apple.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7197 entries, 0 to 7196
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                7197 non-null   int64  
 1   track_name        7197 non-null   object 
 2   size_bytes        7197 non-null   int64  
 3   currency          7197 non-null   object 
 4   price             7197 non-null   float64
 5   rating_count_tot  7197 non-null   int64  
 6   rating_count_ver  7197 non-null   int64  
 7   user_rating       7197 non-null   float64
 8   user_rating_ver   7197 non-null   float64
 9   ver               7197 non-null   object 
 10  cont_rating       7197 non-null   object 
 11  prime_genre       7197 non-null   object 
 12  sup_devices.num   7197 non-null   int64  
 13  ipadSc_urls.num   7197 non-null   int64  
 14  lang.num          7197 non-null   int64  
 15  vpp_lic           7197 non-null   int64  
dtypes: float64(3), int64(8), object(5)
memory 

In [11]:
android.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [12]:
apple.head()

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1


#### Loc and iLoc to retreive data
**Loc** uses labels<br>
**iLoc** uses indexes

In [13]:
#Display a range of rows from 2 to 5
android.iloc[11:14] 

# Alternatively, android.iloc[11:14, :] 


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
11,Name Art Photo Editor - Focus n Filters,ART_AND_DESIGN,4.4,8788,12M,"1,000,000+",Free,0,Everyone,Art & Design,"July 31, 2018",1.0.15,4.0 and up
12,Tattoo Name On My Photo Editor,ART_AND_DESIGN,4.2,44829,20M,"10,000,000+",Free,0,Teen,Art & Design,"April 2, 2018",3.8,4.1 and up
13,Mandala Coloring Book,ART_AND_DESIGN,4.6,4326,21M,"100,000+",Free,0,Everyone,Art & Design,"June 26, 2018",1.0.4,4.4 and up


In [14]:
#Selecting a rang of rows and columns simultaneously
android.iloc[11:14, 2:5] 

Unnamed: 0,Rating,Reviews,Size
11,4.4,8788,12M
12,4.2,44829,20M
13,4.6,4326,21M


In [15]:
android.loc[(android.Rating == 4.1)]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
7,Infinite Painter,ART_AND_DESIGN,4.1,36815,29M,"1,000,000+",Free,0,Everyone,Art & Design,"June 14, 2018",6.1.61.1,4.2 and up
21,Boys Photo Editor - Six Pack & Men's Suit,ART_AND_DESIGN,4.1,654,12M,"100,000+",Free,0,Everyone,Art & Design,"March 20, 2018",1.1,4.0.3 and up
27,Animated Photo Editor,ART_AND_DESIGN,4.1,203,6.1M,"100,000+",Free,0,Everyone,Art & Design,"March 21, 2018",1.03,4.0.3 and up
29,Easy Realistic Drawing Tutorial,ART_AND_DESIGN,4.1,223,4.2M,"100,000+",Free,0,Everyone,Art & Design,"August 22, 2017",1.0,2.3 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10780,Modern Counter 3: FPS Multiplayers battlegro 3,FAMILY,4.1,17,50M,"1,000+",Free,0,Everyone,Strategy,"March 16, 2018",15,4.1 and up
10787,Modern Counter Global Strike 3D,GAME,4.1,297,48M,"50,000+",Free,0,Teen,Action,"March 28, 2018",1.2,4.1 and up
10790,HipChat - beta version,COMMUNICATION,4.1,1035,20M,"50,000+",Free,0,Everyone,Communication,"August 7, 2018",3.20.001,4.1 and up
10800,FR Roster,TOOLS,4.1,174,12M,"5,000+",Free,0,Everyone,Tools,"July 30, 2018",6.04,4.4 and up


In [16]:
android.loc[(android['Category']== 'ART_AND_DESIGN') & (android['Size'] == '12M')]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
11,Name Art Photo Editor - Focus n Filters,ART_AND_DESIGN,4.4,8788,12M,"1,000,000+",Free,0,Everyone,Art & Design,"July 31, 2018",1.0.15,4.0 and up
21,Boys Photo Editor - Six Pack & Men's Suit,ART_AND_DESIGN,4.1,654,12M,"100,000+",Free,0,Everyone,Art & Design,"March 20, 2018",1.1,4.0.3 and up
44,Popsicle Sticks and Similar DIY Craft Ideas,ART_AND_DESIGN,4.2,26,12M,"10,000+",Free,0,Everyone,Art & Design,"January 3, 2018",1.0.0,4.1 and up


##### Apple Dataset 

In [17]:
apple.iloc[10:15, 6:9]

Unnamed: 0,rating_count_ver,user_rating,user_rating_ver
10,97,4.5,4.0
11,132,4.5,4.0
12,9673,4.5,4.5
13,2029,4.5,4.5
14,1087,4.5,4.5


In [18]:
apple.loc[apple.user_rating == 3.5]

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.00,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
22,295646461,"The Weather Channel: Forecast, Radar & Alerts",199734272,USD,0.00,495626,5893,3.5,4.5,8.11,4+,Weather,37,0,33,1
24,284815942,Google – Search made just for mobile,179979264,USD,0.00,479440,203,3.5,4.0,27.0,17+,Utilities,37,4,33,1
27,293622097,Google Earth,37214208,USD,0.00,446185,1359,3.5,3.5,7.1.6,4+,Travel,43,5,30,1
43,304878510,Skype for iPhone,133238784,USD,0.00,373519,127,3.5,4.0,6.35.1,4+,Social Networking,37,0,32,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6104,1119424086,Dumb Ways JR Madcap's Plane,110077952,USD,1.99,3,3,3.5,3.5,1.0,4+,Education,37,5,1,1
6105,1169297765,Fantasy Princess - Girls Makeup & Dress Up Games,97917952,USD,1.99,3,3,3.5,3.5,9999.1.1,4+,Games,40,5,1,1
6118,1015820762,Circle Swing,8642560,USD,0.00,2,2,3.5,3.5,1.2,4+,Games,40,3,1,1
6123,646504584,Ruler and Compass Geometry,24403968,USD,0.99,2,2,3.5,3.5,1.1,4+,Education,26,4,16,1


In [19]:
#View data in the second row
apple.loc[2]

id                       529479190
track_name          Clash of Clans
size_bytes               116476928
currency                       USD
price                          0.0
rating_count_tot           2130805
rating_count_ver               579
user_rating                    4.5
user_rating_ver                4.5
ver                        9.24.12
cont_rating                     9+
prime_genre                  Games
sup_devices.num                 38
ipadSc_urls.num                  5
lang.num                        18
vpp_lic                          1
Name: 2, dtype: object

### Data Cleaning

We need to make sure the data we analyze is accurate, or the results of our analysis will be wrong. This means that we need to do the following:<br>

Detect inaccurate data, and correct or remove it.<br>
Detect duplicate data, and remove the duplicates.<br><br>


Recall that at our company, we only build apps that are free to download and install, and we design them for an English-speaking audience. This means that we'll need to do the following:<br>

Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.<br>
Remove apps that aren't free.<br>

#### Removing Non-English apps



In [20]:
#Requirement: pip install langdetect
from langdetect import detect  # Import the detect function from the langdetect library

In [21]:
# Define a function to check if a given text is in English, it checks for ascii characters
def is_english(text):
    try:
        text.encode(encoding='utf-8').decode('ascii')
        return True
    except UnicodeDecodeError:
        return False

In [22]:
#Prerequisite: pip install textblob
from textblob import TextBlob

def is_english(text):
    try:
        blob = TextBlob(text)
        return blob.detect_language() == 'en'
    except:
        return False

In [23]:
#Android dataset
# Create a new column that indicates whether each value in the column is in English
android['is_english'] = android['App'].apply(lambda x: is_english(str(x)))

android_english_df = android[android['is_english'] == True] #Filter dataframe to include only rows where the value in the 'is_english' column is True
android_non_english_df = android[android['is_english'] == False] #Filter dataframe to include only rows where the value in the 'is_english' column is False

# Print the rows where the value in the 'is_english' column is False
android_non_english_df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,is_english
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up,False
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,False
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up,False
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up,False
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up,False


In [24]:
android_english_df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,is_english


In [25]:
#ios Dataset
# Create a new column that indicates whether each value in the column is in English
apple['is_english'] = apple['track_name'].apply(lambda x: is_english(str(x)))

apple_english_df = apple[apple['is_english'] == True] #Filter dataframe to include only rows where the value in the 'is_english' column is True
apple_non_english_df = apple[android['is_english'] == False] #Filter dataframe to include only rows where the value in the 'is_english' column is False

# Print the rows where the value in the 'is_english' column is False
apple_non_english_df.head()

In [None]:
'''
#Function to check if app is an English app
#Prerequisite: pip install langid
import langid  

#This function uses the langid package to identify the language of each string in the input column. 

def check_english(dataframe,column):
    non_english = []
    for text in dataframe[column]:
        try:
            lang = langid.classify(text)[0]
        except:
            lang != ''
        non_english.append(lang == 'en')
    return non_english
    
'''

"\n#Function to check if app is an English app\n#Prerequisite: pip install langid\nimport langid  \n\n#This function uses the langid package to identify the language of each string in the input column. \n\ndef check_english(dataframe,column):\n    non_english = []\n    for text in dataframe[column]:\n        try:\n            lang = langid.classify(text)[0]\n        except:\n            lang != ''\n        non_english.append(lang == 'en')\n    return non_english\n    \n"

The classify() function of langid returns a tuple (language code, probability) for a given text.<br>
We only need the language code here, which is a two-letter code representing the language (e.g., 'en' for English, 'fr' for French, etc.). <br>
If the classify() function raises an exception, we assume that the string is not in English.

In [None]:
#Finding out the unique values in the Type Column
android['Type'].nunique()

3

In [None]:
#Listing the unique values in the Type Column
android['Type'].unique()

array(['Free', 'Paid', nan, '0'], dtype=object)

In [None]:
android.loc[ (android['Type'] == 'Paid')]

In [None]:
#Isolating Free apps
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

#### Checking for missing values

In [None]:
#Check for missing values
missing =android.isna().sum()
missing

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

In [None]:
#Percentage of missing data
percent_missing = missing/len(android) * 100
percent_missing

App                0.000000
Category           0.000000
Rating            13.596532
Reviews            0.000000
Size               0.000000
Installs           0.000000
Type               0.009224
Price              0.000000
Content Rating     0.009224
Genres             0.000000
Last Updated       0.000000
Current Ver        0.073794
Android Ver        0.027673
dtype: float64

In [None]:
missing_values = apple.isna().sum()
missing_values

id                  0
track_name          0
size_bytes          0
currency            0
price               0
rating_count_tot    0
rating_count_ver    0
user_rating         0
user_rating_ver     0
ver                 0
cont_rating         0
prime_genre         0
sup_devices.num     0
ipadSc_urls.num     0
lang.num            0
vpp_lic             0
dtype: int64

In [None]:
#Perentage of missing values
missing_values/len(apple)

id                  0.0
track_name          0.0
size_bytes          0.0
currency            0.0
price               0.0
rating_count_tot    0.0
rating_count_ver    0.0
user_rating         0.0
user_rating_ver     0.0
ver                 0.0
cont_rating         0.0
prime_genre         0.0
sup_devices.num     0.0
ipadSc_urls.num     0.0
lang.num            0.0
vpp_lic             0.0
dtype: float64

There are no missing values in the apple dataset

#### Checking for duplicates

In [None]:
def check_for_duplicates(android):
    
    if len(android) == len(set(android)):
        # There are no duplicates
        return False
    else:
        # There are duplicates
        return True
    


In [None]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:10])

Number of duplicate apps: 4


Examples of duplicate apps: ['R', 'C', 'C', 'A']
