# Data Visualization Challenge - Google Play Store üì±


The goal of this challenge is to analyze a [dataset about apps and games from Google Play Store](https://www.kaggle.com/lava18/google-play-store-apps#googleplaystore.csv). Some cells are already implemented, you just need to **run** them. Some other cells need you to write some code.

### Here is a quick guide on how to use this Jupyter Notebook ü§î

* Type inside the empty cells to write code. These empty cells will have a `In [ ]:` prefix before
* Press the `return/enter ‚èé` key to add a new line inside the cell
* To display your results use the Python built in `print(STUFF_YOU_WANT_TO_PRINT)` method or simply put the stuff you want to print as the last line inside the cell. The result of the last line will appear as the `Out[]:` or the output of the cell :)
* Press `shift` + `return/enter ‚èé` to run your code ü§ì this will run the code inside your currently selected cell and print anything inside `print()` method and the last line of your cell
* To add a new cell, select any cell and press the `b` key (make sure you are not just typing the letter `b` in the cell). This will add a new cell below
* To delete a cell, double press the `d` key (make sure you are not just typing the letter `d` in the cell)

### Start the challenge by running the two following cells:

In [None]:
# we will need these libraries to run our analytics and visualisation
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# we will read the CSV file into a DataFrame - the format that we can easily analyze and manipulate in Python
apps_df = pd.read_csv("data/googleplaystore.csv")

### üëÄ Feel free to have a quick glance at the data - `.shape`, `.columns`, `.head()`, `.tail()`, `.dtypes` 
<br>

<details>
    <summary>Not sure how? Reveal some tips üôà</summary>

<p> 
<pre>
apps_df.shape
apps_df.columns
apps_df.head() # you can add a number in parentheses for how many first rows you want to display
apps_df.dtypes
</pre>
</details>

In [None]:
# your code here

----------

### üßπ You might have noticed, we have some cleaning to do on our DataFrame 

First of all let's see how many empty values are there in our DataFrame using the [DataFrame.isnull().sum() methods](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html)

<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>

<p> 
<pre>
apps_df.isnull().sum()
</pre>
</details>

In [None]:
# your code here

You'll notice there is one column that is particularly full of `null` values. Let's go ahead and overwrite those with a rating of `0.0`.


<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>


<p>The "Rating" column has over 10% null values - this will mess up our analysis and visualisation later. Let's use the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html">Pandas.DataFrame.fillna()</a> method to fix that.</p>
<pre>
apps_df.fillna({'Rating': 0.0}, inplace=True)
</pre>
</details>

In [None]:
# your code here

Let's also get rid of some columns that will not help our analysis. The more we clean and organise our data before we analyse, the more insight our visualistions will make! üßπ


<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>


<p>Let's use the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html?highlight=drop#pandas.DataFrame.drop">DataFrame.drop()</a> method to drop some columns like we did with AirBnB data.</p>
<pre>
columns_to_drop = ['Android Ver', 'Current Ver', 'Last Updated', 'Genres'] # we will use 'Category' instead of Genres
apps_df.drop(columns_to_drop, axis="columns", inplace=True)
</pre>
</details>

In [None]:
# your code here

If we dig deeper, we will see that some app names are duplicated. It could have been the error of someone scraping the data, or developers trying to impersonate another app to get more downloads. In any case let's use the very handy [DataFrame.drop_duplicates()](https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/) method to clean that up.


<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>


<pre>
apps_df.drop_duplicates(subset="App", keep = 'first', inplace = True) 
</pre>
</details>

In [None]:
# your code here

Finally there is the `Price` column, which right now has the data type `object` (or `string` as we know them). Since this is a numeric value which would be interesting to measure, let's fix that. **This one is a little tricky**, as we first need to get rid of the **$** sign.


<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>


<p>We will use the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html?highlight=drop#pandas.DataFrame.drop">Series.astype()</a> method to convert strings to floats, but first we need to clean up the prices a bit.</p>
<pre>
apps_df['Price'] = apps_df['Price'].str[1:] # getting rid of the dollar sign
apps_df['Price'] = apps_df['Price'].replace('', '0.0') # we are replacing any empty values with 0.0
apps_df['Price'] = apps_df['Price'].astype(float) # now we are converting each price from a string to a float
</pre>
</details>

In [None]:
# your code here

In the future, you might also want to clean up the `Size` column. That will require some more advanced logic - because some sizes are in kilobites and some in megabites... A fun exercise for later! üòâ

---------

### ‚ú®Our DataFrame is now clean and ready for us to explore

#### Let's start with few quick [Exploratory Data Analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis) questions ü§î

‚ùìWhat are the top 5 most expensive apps on the Play Store?


<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>

<p> 
<pre>
apps_df.nlargest(5, 'Price')
</pre>

</details>

In [None]:
# your code here

‚ùìWhat are the 10 most reviewed apps?


<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>

<p> 
<pre>
apps_df.nlargest(10, 'Reviews')
</pre>
</details>

In [None]:
# your code here

‚ùìHow many different categories of apps are there? Try not to count manually! :)


<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>

<p> 
<pre>
apps_df['Category'].unique() # this gives us a list of all the categories
len(apps_df['Category'].unique()) # this gives us the length of the list
</pre>
</details>

In [None]:
# your code here

‚ùìHow many apps are there for each Content Rating?


<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>

<p> 
<pre>
apps_df['Content Rating'].value_counts()
</pre>
</details>

In [None]:
# your code here

‚ùìWhat are the 10 categories with the most apps? Display **only** the top 10!


<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>

<p> 
<pre>
apps_df['Category'].value_counts().head(10)
</pre>
</details>

In [None]:
# your code here

-------

### üé® Let's start visualizing - `CountPlot` for making quick bar charts

For our first visualisations we will be using the [Seaborn Countplot](https://seaborn.pydata.org/generated/seaborn.countplot.html). 

Countplot is great for counting the occurence of the unique values in each column. For example, we used it for knowing how many of each apartment type are there in the AirBnB listings and even how many different types are there, without us running any extra code.

#### Let's begin! üìä

1Ô∏è‚É£Let's start with a simple one - a bar chart of free versus paid apps

P.S. remember the different methods you can use to explore your DataFrame - `.columns`, `.head()` and others - in case you are not sure which data points to use ;)


<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>

<p> 
<pre>
sns.countplot(data=apps_df, x='Type')
</pre>
</details>

In [None]:
# by the way, you can add this line *before each* plot that let's you adjust the (width, height) of the chart:
plt.figure(figsize=(10, 5))

# your code here

Note our data is heavily skewed to one `Type`. We will need to take that into account in future challenges!

--------

2Ô∏è‚É£Let's plot how many apps there are for each content rating?

<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>

<p> 
<pre>
sns.countplot(data=apps_df, x="Content Rating")
</pre>
</details>

In [None]:
# remember, you can add this line *before each* plot that let's you adjust the (width, height) of the chart:
plt.figure(figsize=(10, 5))

# your code here

2Ô∏è‚É£+ | You can notice that we can improve the **order** in which the columns are displayed. How can we do that?

**Tip:** think about what is the order displayed when you did `apps_df['Content Rating'].value_counts()`

<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>

<p> 
<pre>
order = apps_df['Content Rating'].value_counts().index # we are getting the index (or the "order") of our result
sns.countplot(data=apps_df, x="Content Rating", order=order) # and adding that as the order attribute to our plot
</pre>
</details>



In [None]:
# remember, you can add this line *before each* plot that let's you adjust the (width, height) of the chart:
plt.figure(figsize=(10, 6))

# your code here

-------------

3Ô∏è‚É£**Adding a hue**: let's add a `hue` attribute to our Countplot. For example like this `sns.countplot(data=apps_df, x='Content Rating', hue='Type')`. What does that do? Try running the cell below!

In [None]:
# remember that you can change the size of each plot; this will be helpful as our Data Visualization goes deeper
plt.figure(figsize=(15, 6))
sns.countplot(data=apps_df, x='Content Rating', hue='Type')

In this dataset we don't have other small categorical data, which would be a good `hue`. But you can already imagine how this can be very useful when you are doing visualisation, as you factor in categories in your charts ü§î

---------------

üèãÔ∏è‚Äç **Optional: Adding an order**: let's add a `order` attribute to our Countplot so that we can control the order of bars in our chart, as well as how many are displayed. The syntax looks like this `sns.countplot(data=apps_df, x=???, order=‚ùì)`. First let's figure out how we get the order!

Do you remember how you got the categories with the most apps?

`apps_df['Category'].value_counts().head(5)` for example will give us a list of five categories with the most apps

If we want to capture the **order** of this list we need to add the `.index` in the end, so that we store the index (or position) of each category in the list.

`apps_df['Category'].value_counts().head(5).index`

Feel free to give it a try!

In [None]:
order = apps_df['Category'].value_counts().head(5).index
sns.countplot(data=apps_df, x='Category', order=order)

Now your goal is to make a Countplot of the 10 most common amount of `Installs`!


<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>

<p>Tip: don't try to put everything on the same line if it's getting too long. Your code should be easy to read for you and others.</p>
<pre>
top_10_installs = apps_df['Installs'].value_counts().head(10)
order = top_10_installs.index
sns.countplot(data=apps_df, x='Installs', order=order)
</pre>
</details>

In [None]:
# remember that you can change the size of each plot; this will be helpful as our Data Visualization goes deeper
plt.figure(figsize=(15, 6))

# your code here

---------

### üé® Next step -  `Distplot` - a quick way to make [histograms](https://www.mathsisfun.com/data/histograms.html)

The [Seaborn Distplot](https://seaborn.pydata.org/generated/seaborn.distplot.html) is useful for seeing how non-repetative values are distributed. This chart is particularly useful for understanding distribution of numeric values (for example, the price in our AirBnB data) or other data points which do not have standard values. Try running the cell below to make our first Distplot with the prices of the apps!

In [None]:
sns.distplot(apps_df['Price'])

You'll notice that, same as with our AirBnB data, we have outliers that heavily skew the data. A good measure would be to check the [mean](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html) of the columns to better understand our data.

Let's try with `Price`!

In [None]:
apps_df['Price'].mean()

We can see that apps overall are quite cheap, so let's create another `DataFrame` with apps that cost below 10USD. 

For that we will need to use **Boolean Filtering!**

In [None]:
low_price = apps_df['Price'] < 10
cheaper_apps_df = apps_df[low_price]
cheaper_apps_df = cheaper_apps_df[cheaper_apps_df['Price'] > 0] # let's also get rid of the free apps for skewed results


We are now ready for a more realistic `distplot`! Go ahead and make one using the `cheaper_apps_df` DataFrame


<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>


<pre>
sns.distplot(cheaper_apps_df['Price'])
</pre>
</details>

In [None]:
# your code here

Try out your own Distplots by changing to other numeric columns! `Rating` can be a good option!

In [None]:
# your code here

-----

### üé® Onto to the `ViolinPlot` - a chart for analysing grouped distribution

We want to use the [Seaborn Violinplot](https://seaborn.pydata.org/generated/seaborn.violinplot.html) when we want to see distribution not only across our general DataFrame, but grouped by a category column (for example, prices by apartment type on AirBnB). This is useful for understanding the effect of different factors on a datapoint in question.

The typical syntax for a Violinplot is `sns.violinplot(data=‚ùì, x=‚ùì, y=‚ùì)`. For `data` we suggest using the `cheaper_apps_df` DataFrame for most of your analysis, so that we have a more fairly distributed graph :)

Let's begin the first challenge! üöÄ

1Ô∏è‚É£ Let's start with our `apps_df`, where we have the free apps. Make a `violinplot` to see how ratings are distributed among free versus paid apps.

<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>

<pre>
sns.violinplot(data=apps_df, x='Type', y='Rating')
</pre>
</details>

In [None]:
# your code here

‚ùìWhat does the result tell you about app consumers attention to free/paid apps?

-------

2Ô∏è‚É£ Now let's make a `violinplot` of how the prices are distributed among apps from different content ratings! For this one, please use the `cheaper_apps_df` so that we avoid the free apps, which will skew our findings.

<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>

<pre>
sns.violinplot(data=cheaper_apps_df, x='Content Rating', y='Price')
</pre>
<p>Notice how the average price grows as the age category grows. Makes sense right? :)</p>
</details>

In [None]:
# remember, you can adjust the figure size before every plot for readability
plt.figure(figsize=(15, 6))

# your code here

-----

#### We can add some more variety using a [Seaborn Catplot](https://seaborn.pydata.org/generated/seaborn.catplot.html) with different `kind` property - `violin`, `box`, `boxen`, `bar`. Try to run the cell below!

In [None]:
sns.catplot(data=apps_df, x='Content Rating', y='Rating', kind='boxen')

Because `catplot` is a uniquely `seaborn` plot, it doesn't change size with the `plt.figure` code we were using before. To make it more readable, we instead simply add `aspect` as an attribute to our line of code above.

For example, you can add `aspect=15/6` to your list of countplot attributes.

<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>

<pre>
sns.catplot(data=apps_df, x='Content Rating', y='Rating', kind='boxen', aspect=15/6)
</pre>
</details>


In [None]:
# your code here

Go ahead and try out the different kinds of `catplot` apart from `boxen`! And if you are looking for an extra challenge, we've got one.

In [None]:
# your code here

-----

üèãÔ∏è‚Äç **Optional: Adding an order**: We have dozens of app categories - too many to put in one plot - so let's analyse how the price is distributed among ten most popular categories.

**Tip**: we have added orders in previous plots so do not hesitate to check your previous challenges! üßê

<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>

<pre>
order = cheaper_apps_df['Category'].value_counts().head(10).index
sns.catplot(data=cheaper_apps_df, x='Category', y='Price', order=order, kind='boxen', aspect=18/6)
</pre>
<p>Make sure to think through each column. What does that data tell you?</p>
</details>


In [None]:
# your code here

-----

### ü§ì Going further? Let's do some Correlation Analysis

We don't really have a way to put Google Play data on a map like we did with AirBnB listings, but we still want you to explore some advanced visualizations we can do!

----

‚öñÔ∏èLet's jump into **[correlation](https://www.datasciencecentral.com/profiles/blogs/difference-between-correlation-and-regression-in-statistics)**, which is how we can check the relationship between numeric values.

Now let's go ahead and test our first correlation between the price and reviews of apps using [Seaborn Relplot](https://seaborn.pydata.org/generated/seaborn.relplot.html?highlight=relplot#seaborn.relplot). Remember to use `cheaper_apps_df` to get more insight, avoiding the free apps. Do you have any assumptions of what the relation could be? ü§î

In [None]:
with sns.axes_style(style="whitegrid"):
    sns.relplot(x="Price", y="Reviews", data=cheaper_apps_df, aspect=10/5)

You'll notice that we have some outliers when it comes to `Reviews`, which prevents us from gathering insight through our `relplot`. Let's calculate the **mean** of `Reviews` in our DataFrame and change our `cheaper_apps_df` to contain apps with reasonable amount of reviews.

<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>

<pre>
cheaper_apps_df['Reviews'].mean()
reviews_condition = cheaper_apps_df['Reviews'] < 25000
cheaper_apps_df = cheaper_apps_df[reviews_condition]
</pre>
</details>



In [None]:
# your code here

Now you are ready to try `relplot` again! Rerun the code cell where we made our first `relplot` or **rewrite the code below like a real hacker!**

In [None]:
# your code here

-------

You can start noticing there are some clusters in this Relplot. We have a better way of seeing them using a [`seaborn.jointplot`](https://seaborn.pydata.org/generated/seaborn.jointplot.html).

In [None]:
with sns.axes_style("whitegrid"):
    sns.jointplot(x="Price", y="Reviews", data=cheaper_apps_df, height=7, kind='hex')

------

üîç The correlation is still not entirely clear though, is it? Let's use [`seaborn.lmplot`](https://seaborn.pydata.org/generated/seaborn.lmplot.html) to graphically read a linear correlation.

The `lmplot` takes exactly the same **attributes** as the `relplot` you did a few lines ago. Give it a go!

<br>

<details>
    <summary>Not sure how? Click to see solution üôà</summary>

<pre>
sns.lmplot(x="Price", y="Reviews", data=cheaper_apps_df, aspect=12/5)
</pre>
</details>

In [None]:
# your code here

The goal of today is not to do any Machine Learning, but hey! you just did your first **[Regression](https://www.datasciencecentral.com/profiles/blogs/difference-between-correlation-and-regression-in-statistics)** üî•

Following the graphical linear regression we just drew, we can predict new values based on existing data. üîÆ

We can see that the more expensive the app, the more reviews it will get! Most likely because when you pay for an app, you care enough to come back and give feedback. The blue line shows as that if we price our app at 10USD we are likely to get around 4000 reviews!

üèãÔ∏è‚Äç Now your challenge is the **check the correlation between `Reviews` and `Rating`**. Follow the same steps you did for the previous correlation and you totally got this! :)

In [None]:
# your code here

--------

## üöÄ Congrats on completing the challenges! You rock!

If you want to build on top of what you learned today, check out these resources to continue your Python Data Visualization journey üôå

* [Different color pallets of Seaborn plots](https://seaborn.pydata.org/tutorial/color_palettes.html)
* [Play around with Seaborn official tutorial data](https://seaborn.pydata.org/tutorial.html)
* [How to customize your plots with custom labels, legends and more](https://hookedondata.org/better-plotting-in-python-with-seaborn/)