# Google play Store Analysis
Janiel - Maio 2024

----------

#### Target 
Esse projeto busca realizar uma intensa explora√ß√£o de dados ( Exploratory Data Analysis - EDA ) baseado na Metodologia Knowledge Discovery in Databases (KDD).

#### DataSet 
A base de dados pode ser encontrada no [kaggle.](https://www.kaggle.com/datasets/lava18/google-play-store-apps)

#### Inspiration
Os "inshigts" obtidos atrav√©s da an√°lises de intera√ß√µes das pessoas com os aplicativos dispon√≠ves na Google Play Store s√£o altamente relevantes para aquelas empresas que buscam n√£o apenas atributos para seus novos Apps como tamb√©m direcionar seus esfor√ßos e obter um posicionamento estrat√©gico.

<br>
<br>
<hr>


### Import Library

In [1]:
# Download Dataset 
import opendatasets as od
import os

# Manipule Data
import pandas as pd
import numpy as np

# Vizualise data 
import seaborn as sns 
import matplotlib.pyplot as plt
%matplotlib inline


## 1. Data Selection

### 1.1 Load DataSet

In [2]:
#-- od.download('https://www.kaggle.com/datasets/lava18/google-play-store-apps')

In [3]:
#-- os.listdir('google-play-store-apps')

In [4]:
store_df = pd.read_csv('./Dataset_play-store-apps/googleplaystore.csv')

### 1.2 Overview

In [5]:
store_df.head()
# store_df.tail(10)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite ‚Äì FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [6]:
store_df.shape

(10841, 13)

In [7]:
store_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


#### üí° Insights

Note que teremos uma s√©rie de desafios üßê

1. V√°rias colunas possuem dados com padr√µes diferentes ( O tamanho, por exemplo, temos ele em Mega/Kilo byts )
2. Muitos dados est√£o com a formata√ß√£o errada
3. Temos alguns valores nulos

<hr>

## 2. Pr√©-processing

Vamos com Calma üòÖ

### 2.1 Reviews column

In [8]:
# Identificando se todas os dados da Coluna "Reviews" s√£o num√©ricos
store_df[~store_df['Reviews'].str.isnumeric()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


In [9]:
# Removendo a linha
store_df.drop(index=10472, axis=0,inplace=True)

In [10]:
# Alterando o tipo de dados
store_df['Reviews'] = store_df['Reviews'].astype('int')

In [11]:
store_df.dtypes

App                object
Category           object
Rating            float64
Reviews             int32
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

<hr>

### 2.2 Size Column

In [12]:
store_df['Size'].unique()

array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
       '31M', '4.2M', '7.0M', '23M', '6.0M', '6.1M', '4.6M', '9.2M',
       '5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M',
       '1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k',
       '3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.5M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
       '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
       '7.1M', '3.7M', '22M', '7.4M', '6.4M', '3.2M', '8.2M', '9.9M',
       '4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M',
       '4.0M', '2.3M', '7.2M', '2.1M', '42M', '7.3M', '9.1M', '55M',
       '23k', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M',
       '8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M',
       '5.1M', '61M', '66M', '79k', '8.4M', '118k', '44M', '695k', '1.6M',
     

#### Discuss√£o üßê

1. Temos muitos em "K" e "M", vamos converter todos para a mesma unidade 
2. Alguns n√∫meros em Mega s√£o decimais, devemos ter um cuidado maior por conta deles
3. As c√©lulas com valor 'Varies with device' ser√£o convertidas para "nan", depois veremos o que fazer.

In [13]:
# Organizando Valores
store_df['Size'] = store_df['Size'].str.replace('M','000')
store_df['Size'] = store_df['Size'].str.replace('k','')
store_df['Size'] = store_df['Size'].replace('Varies with device',np.nan)

In [14]:
store_df['Size'] = store_df['Size'].astype('float')

In [15]:
store_df['Size']

0        19000.0
1        14000.0
2            8.7
3        25000.0
4            2.8
          ...   
10836    53000.0
10837        3.6
10838        9.5
10839        NaN
10840    19000.0
Name: Size, Length: 10840, dtype: float64

In [16]:
# Resolvendo o problema dos n√∫meros que estavam em casa decimais 

for i in store_df['Size']:
    if i < 10:
        store_df['Size'] = store_df['Size'].replace(i,i*1000)

store_df['Size'] = store_df['Size']/1000

In [17]:
store_df.dtypes

App                object
Category           object
Rating            float64
Reviews             int32
Size              float64
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

<hr>

### 2.3 Installs e Price columns

In [22]:
store_df['Price'].unique()

array(['0', '$4.99', '$3.99', '$6.99', '$1.49', '$2.99', '$7.99', '$5.99',
       '$3.49', '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49',
       '$10.00', '$24.99', '$11.99', '$79.99', '$16.99', '$14.99',
       '$1.00', '$29.99', '$12.99', '$2.49', '$10.99', '$1.50', '$19.99',
       '$15.99', '$33.99', '$74.99', '$39.99', '$3.95', '$4.49', '$1.70',
       '$8.99', '$2.00', '$3.88', '$25.99', '$399.99', '$17.99',
       '$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$2.50',
       '$1.59', '$6.49', '$1.29', '$5.00', '$13.99', '$299.99', '$379.99',
       '$37.99', '$18.99', '$389.99', '$19.90', '$8.49', '$1.75',
       '$14.00', '$4.85', '$46.99', '$109.99', '$154.99', '$3.08',
       '$2.59', '$4.80', '$1.96', '$19.40', '$3.90', '$4.59', '$15.46',
       '$3.04', '$4.29', '$2.60', '$3.28', '$4.60', '$28.99', '$2.95',
       '$2.90', '$1.97', '$200.00', '$89.99', '$2.56', '$30.99', '$3.61',
       '$394.99', '$1.26', '$1.20', '$1.04'], dtype=object)

In [23]:
store_df['Installs'].unique()

array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+', '50+', '100+', '500+',
       '10+', '1+', '5+', '0+', '0'], dtype=object)

In [25]:
# Organizando os valores
itens_mude = ['$',',','+']
columns = ['Price', 'Installs']

for item in itens_mude:
    for col in columns:
        store_df[col] = store_df[col].str.replace(item,'')

store_df['Installs'] = store_df['Installs'].astype('int64')
store_df['Price'] = store_df['Price'].astype('float')

In [26]:
store_df.dtypes

App                object
Category           object
Rating            float64
Reviews             int32
Size              float64
Installs            int64
Type               object
Price             float64
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

<hr>

### 2.4 Last Updated column

In [30]:
store_df['Last Updated'] = pd.to_datetime(store_df['Last Updated'])

In [32]:
store_df.dtypes

App                       object
Category                  object
Rating                   float64
Reviews                    int32
Size                     float64
Installs                   int64
Type                      object
Price                    float64
Content Rating            object
Genres                    object
Last Updated      datetime64[ns]
Current Ver               object
Android Ver               object
dtype: object

<hr>

## 3. Data Cleaning

In [42]:
store_df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,10000,Free,0.0,Everyone,Art & Design,2018-01-07,1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,500000,Free,0.0,Everyone,Art & Design;Pretend Play,2018-01-15,2.0.0,4.0.3 and up
2,"U Launcher Lite ‚Äì FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,5000000,Free,0.0,Everyone,Art & Design,2018-08-01,1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25.0,50000000,Free,0.0,Teen,Art & Design,2018-06-08,Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,100000,Free,0.0,Everyone,Art & Design;Creativity,2018-06-20,1.1,4.4 and up
