# Práctica Final: Google Play Store
Realizado por:
Andres Puente y Francisco Solano López-Bleda

----
## Planteamiento
El objetivo de este proyecto es entrenar a una red neuronal para poder predecir el rating de una aplicación basado en el input que le aportemos

Sidequests que pueden estar guapas:
- Generar comentarios de bots automáticos, basandonons en otros comentarios


## Datos

### Descripción de los datos

#### `googleplaystore.csv`
Contiene información de las aplicaciones alojadas en Google Play Store hasta el 8 de agosto de 2018. Este fichero se usará principalmente para conocer la información de la app. Nos proveerá toda la información necesaria para poder predecir el rating de una aplicación.

| App                       | Category  | Rating                  | Reviews           | Size                     | Installs                | Type                             | Price  | Content Rating          | Genres                  | Last Updated          | Current Ver     | Android Ver                         |
|---------------------------|-----------|-------------------------|-------------------|--------------------------|-------------------------|----------------------------------|--------|-------------------------|-------------------------|-----------------------|-----------------|-------------------------------------|
|  Nombre App | Categoría | Rating de la aplicación | Número de reviews |  Tamaño | Número de instalaciones | paid or free | Precio | Putuacion del contenido | Género de la aplicación |  Última actualización |  Versión actual | Mínima version de Android soportada |

#### `googleplaystore_user_reviews.csv` 
Contiene información de las reviews de las aplicaciones alojadas en Google Play Store hasta el 8 de agosto de 2018. Este fichero se podrá usar usará para poder capturar reviews falsas, así cruzando datos de este dataset y el anterior también podremos predecir el porcentaje de reviews falsas.

| App | Translated_Review | Sentiment | Sentiment_Polarity | Sentiment_Subjectivity |
|--------------------------|-------------------------|--------------------------------------|-------------------------|----------------------------|
| Nombre de la Applicación | Traducción de la review | Opinion: Positiva, Negativa o Neutra | Polaridad de la opinión | Subjetividad de la opinion |

## Imports

In [1]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
from sklearn import metrics

## Data

### Lectura de los datos - Representacion de los tipos de datos

In [170]:
data = pd.read_csv('data/googleplaystore.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
App               10841 non-null object
Category          10841 non-null object
Rating            9367 non-null float64
Reviews           10841 non-null object
Size              10841 non-null object
Installs          10841 non-null object
Type              10840 non-null object
Price             10841 non-null object
Content Rating    10840 non-null object
Genres            10841 non-null object
Last Updated      10841 non-null object
Current Ver       10833 non-null object
Android Ver       10838 non-null object
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


### Lectura de los datos - Visualizacion

Visualizacion de los datos de los 5 primeros elementos

In [171]:
data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


## Transformacion de los datos

Para poder llevar a cabo un buen algoritmo de machine learning transformaremos los datos a int/float cuando sea posible para generar una mejor coorrelacion entre los datos.

Aquellos datos que no se consideren necesarios o relevantes para la aproximacion seran desechados.

In [172]:
#Transformacion de las categorias a int
CategoryString = data["Category"]
categoryVal = data["Category"].unique()
categoryValCount = len(categoryVal)
category_dict = {}
for i in range(0, categoryValCount):
    category_dict[categoryVal[i]] = i
data["Category_i"] = data["Category"].map(category_dict).astype(int)

In [173]:
#Transformacion y limpieza del tamaño de instalacion
def sizes_trans(size):
    if 'M' in size:
        x = size[:-1]
        x = float(x)*1000000
        return(x)
    elif 'k' == size[-1:]:
        x = size[:-1]
        x = float(x)*1000
        return(x)
    else:
        return None
    
data["Size"] = data["Size"].map(sizes_trans)

#Rellenando los vacios
data.Size.fillna(method = 'ffill', inplace = True)

In [174]:
#Transformacion del número de descargas(Installs)
def installs_trans(inst):
    try:
        x = (inst[:-1].replace(",",""))
        x = int(x)
        #return x
    except ValueError:
        print("Line is corrupt!", x)
        x = 0
    return x


data["Installs"] = data["Installs"].map(installs_trans)

#data["Installs"] = [(i[:-1].replace(",","")) for i in data["Installs"]]
#data["Installs"] = data["Installs"].astype(float)

Line is corrupt! 
Line is corrupt! Fre


In [175]:
#Transformacion del Tipo de app(Free/No Free) a binario
def type_trans(types):
    if types == 'Free':
        return 0
    else:
        return 1
    
data["Type"] = data["Type"].map(type_trans)

In [176]:
#Transformacion del "Content Rating" a enteros por cada valor unico
RatingL = data["Content Rating"].unique()
RatingDict = {}
for i in range(len(RatingL)):
    RatingDict[RatingL[i]] = i
    
data["Content Rating"] = data["Content Rating"].map(RatingDict).astype(int)

In [177]:
#Transformacion de los "Genres" a int
GenresL = data.Genres.unique()
GenresDict = {}
for i in range(len(GenresL)):
    GenresDict[GenresL[i]] = i
    
data["Genres_i"] = data["Genres"].map(GenresDict).astype(int)

In [178]:
#Transformacion de los precios a int
def price_trans(price):
    if price == '0':
        return 0
    else:
        price = price[1:]
        if price == "veryone":
            price = 0
        price = float(price)
        return price
    
data["Price"] = data["Price"].map(price_trans).astype(float)

In [179]:
#Transformacion de las reviews a float

def review_trans(rev):
    if 'M' in rev:
        x = rev[:-1]
        x = float(x)*1000000
        return(x)
    elif 'k' == rev[-1:]:
        x = rev[:-1]
        x = float(x)*1000
        return(x)
    else:
        return rev

data["Reviews"] = data["Reviews"].map(review_trans).astype(float)

In [182]:
#Eliminando los parametros que no utilizaremos ya que no son relevantes
data.drop(labels = ["Last Updated", "Current Ver", 'Android Ver', 'App'], axis = 1, inplace = True)

In [185]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 11 columns):
Category          10841 non-null object
Rating            9367 non-null float64
Reviews           10841 non-null float64
Size              10841 non-null float64
Installs          10841 non-null int64
Type              10841 non-null int64
Price             10841 non-null float64
Content Rating    10841 non-null int32
Genres            10841 non-null object
Category_i        10841 non-null int32
Genres_i          10841 non-null int32
dtypes: float64(4), int32(3), int64(2), object(2)
memory usage: 804.7+ KB


In [186]:
data.head()

Unnamed: 0,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Category_i,Genres_i
0,ART_AND_DESIGN,4.1,159.0,19000000.0,10000,0,0.0,0,Art & Design,0,0
1,ART_AND_DESIGN,3.9,967.0,14000000.0,500000,0,0.0,0,Art & Design;Pretend Play,0,1
2,ART_AND_DESIGN,4.7,87510.0,8700000.0,5000000,0,0.0,0,Art & Design,0,0
3,ART_AND_DESIGN,4.5,215644.0,25000000.0,50000000,0,0.0,1,Art & Design,0,0
4,ART_AND_DESIGN,4.3,967.0,2800000.0,100000,0,0.0,0,Art & Design;Creativity,0,2


## Dummies

In [188]:
data2 = pd.get_dummies(data, columns=["Category"])

In [189]:
data2.head()

Unnamed: 0,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Category_i,Genres_i,...,Category_PERSONALIZATION,Category_PHOTOGRAPHY,Category_PRODUCTIVITY,Category_SHOPPING,Category_SOCIAL,Category_SPORTS,Category_TOOLS,Category_TRAVEL_AND_LOCAL,Category_VIDEO_PLAYERS,Category_WEATHER
0,4.1,159.0,19000000.0,10000,0,0.0,0,Art & Design,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3.9,967.0,14000000.0,500000,0,0.0,0,Art & Design;Pretend Play,0,1,...,0,0,0,0,0,0,0,0,0,0
2,4.7,87510.0,8700000.0,5000000,0,0.0,0,Art & Design,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4.5,215644.0,25000000.0,50000000,0,0.0,1,Art & Design,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4.3,967.0,2800000.0,100000,0,0.0,0,Art & Design;Creativity,0,2,...,0,0,0,0,0,0,0,0,0,0
