## Features(Columns) Information
1.  APP : Name of the App
2.  Category : Category under which the App falls
3.  Rating : App rating on google play
4.  Reviews : Number reviews of The App
5.  Install : Number of installs of the App
6.  Type : App is free or paid ?
7.  price : price of the App if it's Free = 0
8.  Content Rating : Appropiate Target Audience of the App
9.  Genres : Genre under which the App falls
10. size : size of the App
11. Last Updated : Date when the app was last updated
12. Current Version : Current Version of the App
13. Android Ver : Min android version required


In [71]:
import pandas as pd
import numpy as np
import seaborn as sns

In [72]:
dataset = pd.read_csv('googleplaystore.csv')

In [136]:
df = dataset.copy()
df.shape

(10841, 13)

### Clean Size Column

In [75]:
# We have one datapoint in Size column with value of 1,000+ and We remove this from Dataset
df = df[~df['Size'].str.contains('\+')]


  df = df[~df['Size'].str.contains('\+')]


In [76]:
# 1 item was removed
df.shape

(10840, 13)

In [77]:
# Convert M ,K Size into Decimal 
df["Size"] = [
    int(float(s[:-1]) * 1e6) if s.endswith("M") else 
    int(float(s[:-1]) * 1e3) if s.endswith("k") else
    float(np.nan) 
    for s in df["Size"]
]

### Clean Installs Column

In [79]:
df["Installs"].unique()

array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+', '50+', '100+', '500+',
       '10+', '1+', '5+', '0+', '0'], dtype=object)

In [80]:
# for item in df["Installs"]:
#     for char in [",", "+"]:
#         df["Installs"].str.replace(char, "")

In [81]:
df["Installs"] = df["Installs"].str.replace("+", "")
df["Installs"] = df["Installs"].str.replace(",", "")
df["Installs"] = df["Installs"].astype(float)

### Clean Last Updated

In [83]:
df["Last Updated"]

0         January 7, 2018
1        January 15, 2018
2          August 1, 2018
3            June 8, 2018
4           June 20, 2018
               ...       
10836       July 25, 2017
10837        July 6, 2018
10838    January 20, 2017
10839    January 19, 2015
10840       July 25, 2018
Name: Last Updated, Length: 10840, dtype: object

In [84]:
# Convet January-7-2018 into Day = 7 , MOnth = 1 , Year = 2018 and then remove Last Updated Coloumn
df["Last Updated"] = pd.to_datetime(df["Last Updated"])
df["Day"] = df["Last Updated"].dt.day.astype(int)
df["Month"] = df["Last Updated"].dt.month.astype(int)
df["Year"] = df["Last Updated"].dt.year.astype(int)

In [85]:
df.drop("Last Updated", axis=1, inplace=True)

In [86]:
df.dtypes

App                object
Category           object
Rating            float64
Reviews            object
Size              float64
Installs          float64
Type               object
Price              object
Content Rating     object
Genres             object
Current Ver        object
Android Ver        object
Day                 int32
Month               int32
Year                int32
dtype: object

### Clean Reviews

In [88]:
df["Reviews"]= df["Reviews"].astype(int)

In [89]:
df["Price"] = df["Price"].str.replace("$", "")
df["Price"] = df["Price"].astype(float)

In [90]:
df["Price"].unique()

array([  0.  ,   4.99,   3.99,   6.99,   1.49,   2.99,   7.99,   5.99,
         3.49,   1.99,   9.99,   7.49,   0.99,   9.  ,   5.49,  10.  ,
        24.99,  11.99,  79.99,  16.99,  14.99,   1.  ,  29.99,  12.99,
         2.49,  10.99,   1.5 ,  19.99,  15.99,  33.99,  74.99,  39.99,
         3.95,   4.49,   1.7 ,   8.99,   2.  ,   3.88,  25.99, 399.99,
        17.99, 400.  ,   3.02,   1.76,   4.84,   4.77,   1.61,   2.5 ,
         1.59,   6.49,   1.29,   5.  ,  13.99, 299.99, 379.99,  37.99,
        18.99, 389.99,  19.9 ,   8.49,   1.75,  14.  ,   4.85,  46.99,
       109.99, 154.99,   3.08,   2.59,   4.8 ,   1.96,  19.4 ,   3.9 ,
         4.59,  15.46,   3.04,   4.29,   2.6 ,   3.28,   4.6 ,  28.99,
         2.95,   2.9 ,   1.97, 200.  ,  89.99,   2.56,  30.99,   3.61,
       394.99,   1.26,   1.2 ,   1.04])

In [91]:
df.to_csv("clean_dataset.csv")

## EDA

In [None]:
df[df.duplicated("App")]

### Observation
The dataset has duplicate records 

In [94]:
df.drop_duplicates(subset=["App"], keep="first", inplace=True)

In [115]:
df.shape

(9659, 15)

In [126]:
numerical_features = [feature for feature in df.columns if df[feature].dtype != "O"]
categorical_features = [feature for feature in df.columns if df[feature].dtype == "O"]

In [144]:
print("We have {} numerical features such as: {}".format(len(numeric_features), numeric_features))
print("We have {} categorical features such as: {}".format(len(categorical_features), categorical_features))

We have 8 numerical features such as: ['Rating', 'Reviews', 'Size', 'Installs', 'Price', 'Day', 'Month', 'Year']
We have 7 categorical features such as: ['App', 'Category', 'Type', 'Content Rating', 'Genres', 'Current Ver', 'Android Ver']


(9659, 15)