# Google Play Store Analysis

### Objective 
<b>Google Play Store</b> team is about to launch a new feature where in certain apps that are
promising are boosted in visibility. <br> The boost will manifest in multiple ways – higher priority in
recommendations sections (“Similar apps”, “You might also like”, “New and updated games”).  <br>
These will also get a boost in visibility in search results.<br> This feature will help bring more
attention to newer apps that have potential.<br>
The task is to understand what makes an app perform well - <b> size? price? category? multiple
factors together? </b><br> <b>Analyze the data and present your insights</b> in a format consumable by
business – the final output of the analysis would be presented to business as insights with
supporting data/visualizations.  

Getting Started

In [2]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

Getting Our Data into Pandas dataframe

In [3]:
data = pd.read_csv('playstore-analysis.csv')

FileNotFoundError: [Errno 2] File playstore-analysis.csv does not exist: 'playstore-analysis.csv'

In [None]:
data.head()

In [None]:
data.describe()

In [None]:
data.describe(include='all')

In [None]:
data.info()

# Data Cleaning

Drop records where rating is missing since rating is our target/study variable

In [None]:
data.Rating

In [None]:
data.Rating.isnull().sum()

In [None]:
# data['Rating'].dropna(inplace=True,axis=0)
data.dropna(subset=["Rating"], axis=0, inplace=True)

In [None]:
data.Rating.isnull().sum()

Check the null values for the Android Ver column.<br>

i. Are all 3 records having the same problem?<br>
ii. Drop the 3 rd record i.e. record for “Life Made WIFI ...”<br>
iii. Replace remaining missing values with the mode<br>


In [None]:
data['Android Ver'].isna().sum()

In [None]:
data[data['Android Ver'].isna()]
# Not all the records has same problem, 3rd row is pretty messed up

Dropping the third value

In [None]:
i = data[data.App=='Life Made WI-Fi Touchscreen Photo Frame'].index

In [None]:
data.drop(i,inplace=True)

In [None]:
data[data['Android Ver'].isna()]

Replacing with mode values


In [None]:
Modeval = data['Android Ver'].value_counts().idxmax()

In [None]:
data['Android Ver'].replace(np.nan, Modeval, inplace=True)

In [None]:
data['Android Ver'].isna().sum()

In [None]:
Currval = data['Current Ver'].value_counts().idxmax()

In [None]:
data['Current Ver'].replace(np.nan, Currval, inplace=True)

In [None]:
data['Current Ver'].isna().sum()

Final Check to see if there is any null values

In [None]:
data.isnull().sum()

# Data Standardisation - correcting the data types

Which all variables need to be brought to numeric types?

In [None]:
data.head() # Reviews, installs, price, curr ver and android ver needs to be numeric

In [None]:
data.info() #Rating and size is already in float64

#### Lets work with price variable first

In [None]:
data[data.Type!='Free'] # price is $4.99 we need to delete $ sign

In [None]:
data.Price = data.Price.str.replace('$','').astype('float64') #delete $ and convert into float

In [None]:
data[data.Type!='Free'].head() # Lets see if that goes well

#### Now we will replace all the + from installs collumn and convert it into int

In [None]:
data.Installs = data.Installs.str.replace('+','')
data.Installs = data.Installs.str.replace(',','').astype(int)

In [None]:
data.info()

In [None]:
data.Reviews = data.Reviews.astype(int) # Changing reviews to int

In [None]:
data.Size #lets look at size

In [None]:
data.Size = data['Size'].apply(lambda x: round(x/1024))
#converting app size in mb 

In [None]:
data.Size

### Sanity checks 

#### Avg. rating should be between 1 and 5, as only these values are allowed on the play store.

In [None]:
data[data.Rating>5] # There are no collumns with rating more than 5

In [None]:
 data[data.Reviews > data.Installs]

In [None]:
inds = data[data.Reviews > data.Installs].index

In [None]:
inds

In [None]:
data.drop(inds,inplace=True)

In [None]:
 data[data.Reviews > data.Installs]

### Identify and handle outliers 

In [None]:
sns.boxplot(x=data['Price'])

In [None]:
data[data.Price>250]

In [None]:
data.drop(data[data.Price>250].index,inplace=True)

In [None]:
sns.boxplot(x=data['Price'])

In [None]:
data[data.Price>=70]

In [None]:
data.drop(data[data.Price>=70].index, inplace=True)

In [None]:
sns.boxplot(x=data['Price'], orient='v')

Do you expect apps on the play store to cost 200? Check out these cases 
Answer: Apparently there is less chance to have app with cost of $200, but there can be an app.

#### Limiting data to records with price < $30

In [None]:
data = data[data.Price<30]

In [None]:
data

#### Review collumn

In [None]:
sns.boxplot(x=data['Reviews'], orient='v')

In [None]:
data.Reviews.value_counts()

In [None]:
data = data[data.Reviews<1000000] #Limit data to apps with < 1 Million reviews

#### Installs

In [None]:
per = np.percentile(data.Installs,95)

In [None]:
data = data.drop(data[data.Installs>per].index) # dropping rows having installs more than 95%

## Data analysis to answer business questions

In [None]:
sns.distplot(data.Rating)

Ratings are more skewed to the higher values, as seen from the plot there are more apps with rating between 4 and 5. whereas there are very less apps with rating 1 and 2

In [None]:
data.Rating.plot(kind='hist')
plt.xlabel('Rating');

Top rating value is 4.5 and then comes 4,5 ratings.<br>
Number of apps with rating between 1 and 2 is really low. so lets drop them.

In [None]:
data.drop(data[data.Rating<2].index,inplace=True)

In [None]:
data.Rating.plot(kind='hist')
plt.xlabel('Rating');

In [None]:
sns.pairplot(data)