In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("tutorial1_3.ipynb")

# Tutorial 1.3: Exploring the Google Play Store with Pandas 

Welcome to Tutorial 1.3!  In today's class we covered pandas and how to manipulate data in a DataFrame.

This tutorial is based on an assignment developed by [Jorge Mendez](https://www.seas.upenn.edu/~mendezme/) at UPenn and explores statistics about reviews from Google Play .

In [None]:
# Run this cell, but please don't change it.

# These lines load the tests.
import otter
grader = otter.Notebook()

import io
import pandas as pd
import numpy as np
import matplotlib as plt

## 1. Reading files
 
We provided two csv files in the `data/` directory. This tutorial will be based on those files

<!--
BEGIN QUESTION
name: q1_1
points: 1
-->

**Question 1.2:** Read in the csv file of the reviews and store the data as a dataframe and assign the dataframe the name `reviews_df`

In [None]:
reviews_df = ...
reviews_df.head(5)

In [None]:
grader.check("q1_1")

<!--
BEGIN QUESTION
name: q1_2
points: 1
-->

**Question 1.2:** Read in the csv file of the data from the google play store. Save the data as a dataframe and assign the dataframe the name `apps_df`

In [None]:
apps_df = ...
apps_df.head(5)

In [None]:
grader.check("q1_2")

<!--
BEGIN QUESTION
name: q1_3
points: 1
-->

Remember that in a table, each row represents an new individual item and the columns represent the item's attributes or features.
When working with a table, it is a good idea to get a sense of what the different attributes are

**Question 1.3:** Extract the names of the columns from `apps_df` as a list or index and assign the answer to the variable `store_columns`

*Hint:* The DataFrame object has an attribute that will return the column labels of the DataFrame.

In [None]:
store_columns = ...
store_columns

In [None]:
grader.check("q1_3")

<!--
BEGIN QUESTION
name: q1_4
points: 1
-->

**Question 1.4:** Extract the names of the columns from `reviews_df` as a list or index and assign the answer to the variable `review_columns`

*Hint:* The DataFrame object has an attribute that will return the column labels of the DataFrame.

In [None]:
review_columns = ...
review_columns

In [None]:
grader.check("q1_4")

## 2. Data Filtering

Often times an individual in a table may be missing a value for one or more attributes. 
Usually, missing values will be represented as a numpy `Nan`. According to the [NumPy documentation](https://numpy.org/doc/stable/reference/constants.html#numpy.nan), a `Nan` is 
> A floating point representation of Not a Number

The following line is how we access the Nan value.

In [None]:
np.nan

Since a Nan is a floating point representation, that means we can compare it to numbers, i.e. integers and floats

In [None]:
np.nan > 1

In [None]:
np.nan < -1

In [None]:
np.nan == 0

We check if the value assigned to a variable is `Nan` by using the `numpy` method called `.isnan()` as shown in the next few cells

In [None]:
nan_variable = np.nan
one = 1

np.isnan(one), np.isnan(nan_variable)

In [None]:
array_with_one_nan = np.append(np.arange(10), np.nan)

array_with_one_nan, np.isnan(array_with_one_nan)

<!--
BEGIN QUESTION
name: q2_1
points: 
    - 0.75
    - 0.75
-->

Often times an individual in a table may be missing a value for one or more attributes.

**Question 2.1:** Remove any review from `reviews_df` that does not contain either a Translated Review or a Sentiment and store the resulting dataframe in the same `reviews_df` variable. 
    
*Hint:* The `pd.dropna()` function will be helpful for this. 

In [None]:
reviews_df = ...
reviews_df.shape

In [None]:
grader.check("q2_1")

Often times we need to validate our data and remove outliers or values that are invalid

The following cell will print out a pandas Series where the index is a rating and the corresponding value is the number of apps that have that rating.

In [None]:
apps_df['Rating'].value_counts()

<!--
BEGIN QUESTION
name: q2_2
points: 1
-->

Looking towards the bottom, we will notice that there is one app that has a rating of 19.0, however, the ratings can only range betwen 0 and 5.

**Question 2.2:** Remove any apps from `apps_df` whose `Rating` is invalid (> 5) and store the resulting dataframe in `apps_df`

In [None]:
apps_df = ...

In [None]:
grader.check("q2_2")

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_3
points: 3
manual: true
-->


**Question 2.3:** `Translated_Review` contains the text of the reviews. Create a new column in `reviews_df` called `Review_Length` that contains the number of words in each review (for this question assume that words are seperated by white space). Then use the DataFrame `describe()` function to print descriptive statistics about the length of the reviews.

<!-- END QUESTION -->



Run the next cell to plot a histogram of the number of words in each review.

In [None]:
reviews_df['Review_Length'].plot.hist(bins=np.arange(1,75,5))

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_4
points: 2
manual: true
-->


**Question 2.4:** Based on the descriptive statistics, do you think the mean is roughly equal to, higher, or lower the median value. Justify

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## 3. Visualization

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q3_1
points: 1.5
manual: true
-->


**Question 3.1:** Produce a pie chart with the Android Ver requirements for the different apps. Group together all
versions that make up less than 5% of the total apps into a single `Other` category. This should look
similar to ![this](images/pie.png). Don't forget to include a title for the figure.

*Hint 1:* You will find the df.value counts() function useful for solving this problem.

*Hint 2:* This [stackoverflow](https://stackoverflow.com/questions/55564896/pandas-python-grouping-counts-to-others) answer will be useful.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q3_2
points: 1.5
manual: true
-->


**Question 3.2:** Create a similar pie chart for app `Category`. In this case, group together categories that make up less
than 3% of the apps. The resulting graph should look something like ![](images/pie2.png) .

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q3_3
points: 2
manual: true
-->


**Question 3.3:** Generating histograms of the Rating and Reviews across all apps, with 20 bins each. The histgrams should look like ![this](images/histograms.png)


*Hint:* Remember that histograms are used for numeric data. You might need to convert the values in one of the columns to a numeric type.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q3_4
points: 1.5
manual: true
-->


**Question 3.4:** Plot a bar chart with the number of reviews that received the different `Sentiment` values. The sentiments chart should look similar to ![this](images/bar.png)

<!-- END QUESTION -->



## 4. Combining Dataframes

<!--
BEGIN QUESTION
name: q4.1
points: 1
-->


**Question 4.1:** Combine the two DataFrames into a single one, based on the App names, and store the resulting dataframe in a variable called `merged_df`. You should make sure that
all apps from the apps DataFrame are kept, and no app beyond those is added. 

*Hint:* The `pd.merge` function will be useful.

In [None]:
merged_df = ...

print(f"merged_df has {merged_df.shape[0]} rows and {merged_df.shape[1]} columns")
merged_df.head(5)

In [None]:
grader.check("q4.1")

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q4.2
points: 4
manual: true
-->


**Question 4.2:** Group the Sentiment by rounded Rating, and produce a bar chart where you display the different
sentiments grouped by rating. The chart should look like ![this](images/combined_bar.png)

*Hint:* You might find the `np.round`, `pd.groupby` and `df.unstack` functions
helpful for this task.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()