# Google Play Store Exploratory Data Analysis

Congratulations! You've been hired by APPDEV Inc. as their latest Data Science Intern, to help them derive insights from a Google Play Store dataset they've provided, in order to help them know what type of apps they can make next, that will fetch them lots of users and in turn, more revenue.

Your tasks are very simple, as they have been very much outlined for you...
all you just have to do is follow them intuitively, and you might wanna take some notes along the way.

Good luck!

In [16]:
# Run this cell and the next to install and import the required packages.
! pip install pandas matplotlib seaborn scipy



In [17]:
# import the required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Time to explore your data

1. read in the CSV file
2. print the first 10 rows

In [1]:
# Code here...

3. check out more `info` about the data. 

In [2]:
# Code here...

4. seems like there are some missing values, find out how many there are...

In [3]:
# Code here...

5. are there `duplicated` values as well?

(*take some notes here*)

In [4]:
# Code here...

# Now, let's clean the dirty data

Natural instinct is we drop the rows containing the missing values, right?
but we will be losing lots of data if we do that, let's try this instead:

1. define a threshold that won't allow lose of data above 5% of the whole dataset.
2. fetch the columns with missing values less than our set threshold into a list.
3. drop the missing values in the rows of the collected columns from the original dataset.
4. drop the duplicates as well, if there are any.
5. now, let's recheck the `info` of our data.

In [5]:
# Code here...

our data doesn't seem to look much as if we picked it from a dump site like it did earlier, right?
well, we are not there yet, as we still have some missing data in the `Rating` column, let's see how we can fill that void and make our data as white as snow, so, stay with me (^_^)...

6. `group` the data `by` the `Installs` column and compute the `median` of the `Rating` column, store the result as a dictionary.
7. on the original dataset, `fill` the missing values of the `Rating` column by `mapping` the dictionary on the `Installs` column.
8. if you check our data now, you'll realize there are still some missing values as we couldn't fill them all, but not to worry, you can `drop` these ones, since they are not much.
9. recheck our data now, looks perfect, doesn't it?

(*take some notes here*)

In [6]:
# Code here...

# Let's perform some transformation magic

If you take a look at the first few rows of our dataset, you'll notice that the `Installs` column has commas and plus signs, they should be numbers, but when you check the data type of this column and that of `Reviews`, you see that they are objects, we need to make them integers, as we will need them later:

1. replace the comma sign in the `Installs` column with nothing.
2. do the same for the plus sign.
3. now, convert the data type of the `Installs` column to integer.
4. do the same for the `Reviews` column.
5. recheck the original dataset to see if the changes have taken place.

(*you might wanna take some notes here*)

In [7]:
# Code here...

# Time for some Visual Scenes

APPDEV Inc. wants to know what category has the most apps and which has the least...

Let's explore some distributions on the App Categories:

1. create a histogram plot on the `Category` column.
2. label the x-axis: `Category of Apps on PlayStore`, y-axis: `Count of Apps in each Category`, title: `Distribution of App Categories`.
3. show the plot. 

In [8]:
# Code here...

oh my..., that doesn't look like much, we can't derive good insights from it, looks kinda ugly, right?, especially the x-axis, urgh!
well, let's add some cosmetics:

1. copy the code from the previous cell into the next.
2. this time, change the plot into a `countplot` from the seaborn package.
3. add a figure size of 10 by 6, to make it look bigger, do this at the very first line of the cell.
4. you can add `grid` lines, to add a touch of accuracy in reading.
5. the labels stay the same, but add an `xticks` with a `rotation` of 90 degrees.
6. let's add some coloration to the `countplot` to improve the beauty: `hue='Category'`.
7. now, let's behold the beauty.

(*take some notes here*)

In [9]:
# Code here...

...looked really good, right?

now, APPDEV Inc. wants to know if there is any correlation between ratings and reviews... 

let's perform couple more visuals, this time, on the relationship between the `Reviews` and `Rating`:

1. set the figure size like you did earlier.
2. construct the `Rating` column against the `Reviews` column on a `scatterplot`.
3. label the x-axis: `Number of Reviews` and the title: `Relationship between Reviews and Ratings`.
4. now, `show` your plot.

(*take some notes here*)

In [10]:
# Code here...

...that was owkae.

moving on, APPDEV Inc. wants to know which category has more installs...

let's find out the distribution of `Installs` across the Categories:

1. use a `boxplot` to construct the `Installs` column against the `Category` column.
2. label the x-axis: `Category of Apps on PlayStore`, y-axis: `Amount of Installs in each Category`, title: `Distribution of Installs across Categories`.
3. show the plot. 

In [11]:
# Code here...

what is this? (\*o*)...

I bet that was the expression on your face.

yeah, it looked ridiculous, right?, even worse than the last one, we definitely can't make sense of this and the board might have a heart attack if we present this to them...

so many outliers, let's fix that:

1. create another column, in the original dataset, called `Installs_log`, perform a logarithmic computation on the `Installs` column using the `log()` method from the numpy package and save to the new column.
2. subset both columns from the original dataset to view the first 5 rows side by side.

In [12]:
# Code here...

now, let's construct the `boxplot` again, this time, with beauty:

1. copy the code from the previous cell into the next.
2. add a figure size of 10 by 6
3. use `Installs_log` instead of `Installs`.
4. add some coloration to the plot using the `Category` column.
5. the labels stay the same, but make the y-axis label: `Log-Amount of Installs in each Category` and add a `rotation` of 90 degrees.
6. now, let's behold the beauty.

(*you might wanna take some notes here*)

In [13]:
# Code here...

...wow, looks perfect, doesn't it? (^_^).

# We have an Hypothesis...

A thought just crossed the MD's mind, he's curious to know if there's any difference in the ratings of paid apps vs. free apps, he's come to your desk and asked you to get rid of his curiousity... 

how do we go about this?, let's find out:

1. import `ttest_ind` from `stats` module of the `scipy` package, so we can find out what's up between these `two independent` variables.
2. subset the original dataset to filter out the free apps from the `Type` column, save the new DataFrame in `free_apps`.
3. do the same for paid apps, save to `paid_apps`.

In [14]:
# Code here...

you've been provided an alpha value and test result variables have been set for you, now:

1. conduct the test, using the `Rating` column of the free apps against that of the paid apps
2. create a result dictionary that takes the `p_value` as a key, then the value should be `Reject null hypothesis: There is a significant difference in ratings.` if the p_value was less than the alpha value, or `Failed to reject null hypothesis: There is no significant difference in ratings.` if not.
3. print the result.

(*take some notes here*)

In [15]:
# Code here...

# Well Done!

Wow, great job so far...
you've really done well.

but, it's not over... just one last thing tho:

* now that you've explored, cleaned and derived some insights from the data, it's time for you to present what you've found
* you have the whole of APPDEV Inc. in front of you, based on the notes you've taken, tell them a story to remember, one that they can make informed decision from.

Good luck! (^_^)