# 1. Defining The Goal

Mobile apps are everywhere. They are easy to create and can be lucrative. Because of these two factors, more and more apps are being developed. In this project, you will do a comprehensive analysis of the Android app market by comparing over ten thousand apps in Google Play across different categories. You'll look for insights in the data to devise strategies to drive growth and retention. The following are the suggested steps, to achieve the objectives of the project.


1. Define the goal
2. Get the data
3. Clean the data
4. Enrich the data
5. Find insights and Visualize
6. Iterate
7. Report
8. Conclusion


# 2. Get the Data

## Import the libraries

In [1]:
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import math
import random
import os
import time



In [2]:
data = pd.read_csv("C:/Users/bariu/Desktop/Jenga-Project/Android apps.csv")
print(data.shape)

(50217, 9)


In [3]:
data.head()

Unnamed: 0,basename,category,company,age_rating,Downloads,Category2,price,rating,numberreviews
0,netflix,entertainment,"netflix, inc.",no info,500000000.0,entertainment,free,4.5,7287852
1,facebook,communication,facebook,everyone,1000000000.0,communication,free,4.2,69050158
2,android,communication,google llc,everyone,5000000.0,internet browser,free,4.3,17065648
3,google,communication,google llc,everyone,5000000.0,mail,free,4.4,6272191
4,grindrapp,social,grindr llc,18+,10000000.0,dating,free,3.5,365432


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50217 entries, 0 to 50216
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   basename       50217 non-null  object 
 1   category       50215 non-null  object 
 2   company        50215 non-null  object 
 3   age_rating     50217 non-null  object 
 4   Downloads      40816 non-null  float64
 5   Category2      50217 non-null  object 
 6   price          49899 non-null  object 
 7   rating         49899 non-null  object 
 8   numberreviews  49899 non-null  object 
dtypes: float64(1), object(8)
memory usage: 1.9+ MB


## 3. Clean the Data

Lets look at the columns


In [5]:
# missings data
total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending =False)
missing_data = pd.concat([total,percent], axis =1, keys =["Total", "percent"])
missing_data.head(6)


Unnamed: 0,Total,percent
Downloads,9401,0.187208
numberreviews,318,0.006333
rating,318,0.006333
price,318,0.006333
company,2,4e-05
category,2,4e-05


In [6]:
# convert "error during scraping" values to NaN values
data = data.replace(['error during scraping'],'NaN')

In [7]:
# drop columns with NaN values
data = data.dropna()

In [8]:
print(data.info())
print(data.describe())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40812 entries, 0 to 50216
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   basename       40812 non-null  object 
 1   category       40812 non-null  object 
 2   company        40812 non-null  object 
 3   age_rating     40812 non-null  object 
 4   Downloads      40812 non-null  float64
 5   Category2      40812 non-null  object 
 6   price          40812 non-null  object 
 7   rating         40812 non-null  object 
 8   numberreviews  40812 non-null  object 
dtypes: float64(1), object(8)
memory usage: 1.9+ MB
None
          Downloads
count  4.081200e+04
mean   4.006455e+06
std    3.638507e+07
min    0.000000e+00
25%    1.000000e+04
50%    1.000000e+05
75%    1.000000e+06
max    1.000000e+09


In [9]:
#converting 
data.isnull().sum()

basename         0
category         0
company          0
age_rating       0
Downloads        0
Category2        0
price            0
rating           0
numberreviews    0
dtype: int64

In [10]:
#checking for duplicates in the whole dataset
duplicate = data[data.duplicated()]
print("Duplice Rows :" )
#print the resultant DataFrame
duplicate

Duplice Rows :


Unnamed: 0,basename,category,company,age_rating,Downloads,Category2,price,rating,numberreviews
1281,runtastic,health & fitness,runtastic,everyone,100000.0,health & fitness,,,
1309,runtastic,health & fitness,runtastic,everyone,100000.0,health & fitness,,,
1387,vrt,entertainment,vrt,no info,1000.0,entertainment,,,
2675,runtastic,health & fitness,runtastic,everyone,1000000.0,health & fitness,,,
2676,runtastic,health & fitness,runtastic,everyone,1000000.0,health & fitness,,,
...,...,...,...,...,...,...,...,...,...
50211,hu,education,hogeschool utrecht,everyone,10000.0,education,free,1.6,48
50213,thalys,news & magazines,thalys,everyone,1000.0,news & magazines,free,2.2,26
50214,dreame,books & reference,dreame media,16+,1000000.0,books & reference,free,4.5,10331
50215,thalys,news & magazines,thalys,everyone,1000.0,news & magazines,free,2.2,26


### colums: basename

In [11]:
# colums: basename
#checking for duplicates
duplicate1= data[data.duplicated('basename')]
print("Duplicate Rows :")
#print the resultant Dataframme
duplicate1

Duplicate Rows :


Unnamed: 0,basename,category,company,age_rating,Downloads,Category2,price,rating,numberreviews
6,google,video players & editors,google llc,no info,5.000000e+06,video players & editors,free,4.4,57127897
7,facebook,social,facebook,no info,1.000000e+09,social,free,4.2,91754952
10,google,travel & local,google llc,everyone,5.000000e+06,travel & local,free,4.3,10678723
12,facebook,business,facebook,everyone,5.000000e+07,business,free,4.1,1466919
24,google,tools,google llc,everyone,5.000000e+06,tools,free,4.2,25684314
...,...,...,...,...,...,...,...,...,...
50211,hu,education,hogeschool utrecht,everyone,1.000000e+04,education,free,1.6,48
50213,thalys,news & magazines,thalys,everyone,1.000000e+03,news & magazines,free,2.2,26
50214,dreame,books & reference,dreame media,16+,1.000000e+06,books & reference,free,4.5,10331
50215,thalys,news & magazines,thalys,everyone,1.000000e+03,news & magazines,free,2.2,26


In [12]:
# removing Duplicates
data = pd.concat([data, duplicate, duplicate1]).drop_duplicates(keep=False)

### column: rating

In [13]:
# column: rating
data["rating"]

2                    4.3
3                    4.4
4                    3.5
5                    4.0
9                    4.6
              ...       
50052    rating disabled
50053    rating disabled
50061                4.4
50064    rating disabled
50065    rating disabled
Name: rating, Length: 26580, dtype: object

In [14]:
#convert "rating disabled" to NaN
data["rating"] = data["rating"].replace(['rating disabled'],'NaN')

In [15]:
# convert rating to float
data["rating"] = data.rating.astype(float)

### Column: price

In [None]:
data["price"]

In [None]:
data.info()