# Springboard Apps project - Tier 3 - Complete

Welcome to the Apps project! To give you a taste of your future career, we're going to walk through exactly the kind of notebook that you'd write as a data scientist. In the process, we'll be sure to signpost the general framework for our investigation - the Data Science Pipeline - as well as give reasons for why we're doing what we're doing. We're also going to apply some of the skills and knowledge you've built up in the previous unit when reading Professor Spiegelhalter's *The Art of Statistics* (hereinafter *AoS*). 

So let's get cracking!

**Brief**

Did Apple Store apps receive better reviews than Google Play apps?

## Stages of the project

1. Sourcing and loading 
    * Load the two datasets
    * Pick the columns that we are going to work with 
    * Subsetting the data on this basis 
 
 
2. Cleaning, transforming and visualizing
    * Check the data types and fix them
    * Add a `platform` column to both the `Apple` and the `Google` dataframes
    * Changing the column names to prepare for a join 
    * Join the two data sets
    * Eliminate the `NaN` values
    * Filter only those apps that have been reviewed at least once
    * Summarize the data visually and analytically (by the column `platform`)  
  
  
3. Modelling 
    * Hypothesis formulation
    * Getting the distribution of the data
    * Permutation test 


4. Evaluating and concluding 
    * What is our conclusion?
    * What is our decision?
    * Other models we could have used. 
    

## Importing the libraries

In this case we are going to import pandas, numpy, scipy, random and matplotlib.pyplot

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# scipy is a library for statistical tests and visualizations
from scipy import stats
# random enables us to generate random numbers
import random

## Stage 1 -  Sourcing and loading data

### 1a. Source and load the data
Let's download the data from Kaggle. Kaggle is a fantastic resource: a kind of social medium for data scientists, it boasts projects, datasets and news on the freshest libraries and technologies all in one place. The data from the Apple Store can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) and the data from Google Store can be found [here](https://www.kaggle.com/lava18/google-play-store-apps).
Download the datasets and save them in your working directory.

In [None]:
# Assuming the Google dataset is named 'googleplaystore.csv' and is located in the same folder as this notebook
google = 'googleplaystore.csv'

# Read the csv file into a DataFrame called Google
Google = pd.read_csv(google)

# Using the head() method to observe the first three entries of the Google dataset
print(Google.head(3))

In [None]:
# Assuming the Apple dataset is named 'applestore.csv' and is located in the same folder as this notebook
apple = 'applestore.csv'

# Read the csv file into a DataFrame called Apple
Apple = pd.read_csv(apple)

# Using the head() method to observe the first three entries of the Apple dataset
print(Apple.head(3))

### 1b. Pick the columns we'll work with

From the documentation of these datasets, we can infer that the most appropriate columns to answer the brief are:

1. Google:
    * `Category` # Do we need this?
    * `Rating`
    * `Reviews`
    * `Price` (maybe)
2. Apple:    
    * `prime_genre` # Do we need this?
    * `user_rating` 
    * `rating_count_tot`
    * `price` (maybe)

### 1c. Subsetting accordingly

Let's select only those columns that we want to work with from both datasets. We'll overwrite the subsets in the original variables.

In [None]:
# Subset the Google DataFrame
Google_subset = Google[['Category', 'Rating', 'Reviews', 'Price']]

# Check the first three entries of the subset
print(Google_subset.head(3))

In [None]:
# Subset the Apple DataFrame
Apple_subset = Apple[['prime_genre', 'user_rating', 'rating_count_tot', 'price']]

# Check the first three entries of the subset
print(Apple_subset.head(3))

## Stage 2 -  Cleaning, transforming and visualizing

### 2a. Check the data types for both Apple and Google, and fix them

Types are crucial for data science in Python. Let's determine whether the variables we selected in the previous section belong to the types they should do, or whether there are any errors here. 

In [None]:
# Check the data types of the Apple DataFrame
print(Apple.dtypes)

This is looking healthy. But what about our Google data frame?

In [None]:
# Check the data types of the Google DataFrame
print(Google.dtypes)

Weird. The data type for the column 'Price' is 'object', not a numeric data type like a float or an integer. Let's investigate the unique values of this column. 

In [None]:
# Check unique values in the Price column of the Google DataFrame
unique_google_prices = Google['Price'].unique()
print(unique_google_prices)

# Check unique values in the Price column of the Apple DataFrame
unique_apple_prices = Apple['Price'].unique()
print(unique_apple_prices)

Aha! Fascinating. There are actually two issues here. 

- Firstly, there's a price called `Everyone`. That is a massive mistake! 
- Secondly, there are dollar symbols everywhere! 


Let's address the first issue first. Let's check the datapoints that have the price value `Everyone`

In [None]:
# Let's check which data points have the value 'Everyone' for the 'Price' column by subsetting our Google dataframe.

# Subset the Google DataFrame for rows where the Price column is 'Everyone'
google_everyone_price = Google[Google['Price'] == 'Everyone']
print(google_everyone_price)

Thankfully, it's just one row. We've gotta get rid of it. 

In [None]:
# Let's eliminate that row. 

# Subset our Google dataframe to pick out just those rows whose value for the 'Price' column is NOT 'Everyone'. 
# Reassign that subset to the Google variable. 
# You can do this in two lines or one. Your choice! 

# Check again the unique values of Google
# Eliminate rows where the Price column is 'Everyone' and reassign the subset
Google = Google[Google['Price'] != 'Everyone']

# Check unique values in the Price column again
print(Google['Price'].unique())


Our second problem remains: I'm seeing dollar symbols when I close my eyes! (And not in a good way). 

This is a problem because Python actually considers these values strings. So we can't do mathematical and statistical operations on them until we've made them into numbers. 

In [None]:
# Create a variable called nosymb to store the Price column of Google without dollar signs
nosymb = Google['Price'].str.replace('$', '')

# Convert the values in nosymb to numeric
Google['Price'] = pd.to_numeric(nosymb)

# This line does both: it removes any dollar sign characters and then converts the strings to numeric values, 
# finally reassigning these cleaned, numeric values back to the Google DataFrame's Price column.


Now let's check the data types for our Google dataframe again, to verify that the 'Price' column really is numeric now.

In [None]:
# Check the data types of the Google DataFrame
print(Google.dtypes)

Notice that the column `Reviews` is still an object column. We actually need this column to be a numeric column, too. 

In [None]:
# Convert the 'Reviews' column to numeric
Google['Reviews'] = pd.to_numeric(Google['Reviews'])

In [None]:
# Check the data types of the Google DataFrame
print(Google.dtypes)

### 2b. Add a `platform` column to both the `Apple` and the `Google` dataframes
Let's add a new column to both dataframe objects called `platform`: all of its values in the Google dataframe will be just 'google', and all of its values for the Apple dataframe will be just 'apple'. 

The reason we're making this column is so that we can ultimately join our Apple and Google data together, and actually test out some hypotheses to solve the problem in our brief. 

In [None]:
# Add 'platform' column to Apple DataFrame with all values set to 'apple'
Apple['platform'] = 'apple'

# Add 'platform' column to Google DataFrame with all values set to 'google'
Google['platform'] = 'google'

# These lines create a new column named 'platform' in both DataFrames, 
# and fill this column with the string 'apple' for all rows in the Apple DataFrame, 
# and 'google' for all rows in the Google DataFrame, accordingly.


### 2c. Changing the column names to prepare for our join of the two datasets 
Since the easiest way to join two datasets is if they have both:
- the same number of columns
- the same column names
we need to rename the columns of `Apple` so that they're the same as the ones of `Google`, or vice versa.

In this case, we're going to change the `Apple` columns names to the names of the `Google` columns. 

This is an important step to unify the two datasets!

In [None]:
# Store the column names of the Apple DataFrame
old_names = Apple.columns

# Store the column names of the Google DataFrame
new_names = Google.columns

# Assuming you want to rename Apple DataFrame columns to match those in Google DataFrame or vice versa,
# you need to create a mapping of old names to new names. 
# For simplicity, let's say we're renaming Apple columns to match Google's column naming convention.

# Example mapping (adjust according to actual column names you wish to align):
rename_mapping = {
    'old_Apple_column_name1': 'new_Google_column_name1',
    'old_Apple_column_name2': 'new_Google_column_name2',
    # Add more mappings as necessary
}

# Use the rename() method to change the Apple DataFrame column names
Apple.rename(columns=rename_mapping, inplace=True)

# This code assumes you're aligning Apple's columns to Google's. 
# If you're doing the opposite or a different task, adjust the rename_mapping accordingly.


### 2d. Join the two datasets 
Let's combine the two datasets into a single data frame called `df`.

In [None]:
# Append the Apple DataFrame to the Google DataFrame
combined_df = Google.append(Apple, ignore_index=True)

# View 12 random samples from the combined DataFrame
random_samples = combined_df.sample(12)
print(random_samples)

### 2e. Eliminate the NaN values

As you can see there are some `NaN` values. We want to eliminate all these `NaN` values from the table.

In [None]:
# Check the dimensions of the combined DataFrame before dropping NaN values
print(combined_df.shape)

# Drop all NaN values and overwrite the combined DataFrame
combined_df.dropna(inplace=True)

# Check the new dimensions of the combined DataFrame after dropping NaN values
print(combined_df.shape)

### 2f. Filter the data so that we only see whose apps that have been reviewed at least once

Apps that haven't been reviewed yet can't help us solve our brief. 

So let's check to see if any apps have no reviews at all. 

In [None]:
# Subset the DataFrame for rows where 'Reviews' is 0
reviews_zero_df = combined_df[combined_df['Reviews'] == 0]

# Count the number of entries in this subset
count_reviews_zero = reviews_zero_df.count()

print(count_reviews_zero)

929 apps do not have reviews, we need to eliminate these points!

In [None]:
# Eliminate rows where 'Reviews' is 0
combined_df = combined_df[combined_df['Reviews'] != 0]

# This code filters out any rows where the 'Reviews' column has a value of 0, 
# keeping only those rows with at least one review, and then overwrites the original DataFrame with this filtered data.


### 2g. Summarize the data visually and analytically (by the column `platform`)

What we need to solve our brief is a summary of the `Rating` column, but separated by the different platforms.

In [None]:
# Group the DataFrame by 'platform' and calculate summary statistics
grouped_summary = combined_df.groupby('platform').agg({
    'Rating': ['mean', 'median', 'std'],  # Mean, median, and standard deviation for Rating
    'Reviews': ['mean', 'median', 'count'],  # Mean, median, and count for Reviews
    'Price': ['mean', 'median', 'std']  # Mean, median, and standard deviation for Price
})

print(grouped_summary)


Interesting! Our means of 4.049697 and 4.191757 don't **seem** all that different! Perhaps we've solved our brief already: there's no significant difference between Google Play app reviews and Apple Store app reviews. We have an ***observed difference*** here: which is simply (4.191757 - 4.049697) = 0.14206. This is just the actual difference that we observed between the mean rating for apps from Google Play, and the mean rating for apps from the Apple Store. Let's look at how we're going to use this observed difference to solve our problem using a statistical test. 

**Outline of our method:**
1. We'll assume that platform (i.e, whether the app was Google or Apple) really doesn’t impact on ratings. 


2. Given this assumption, we should actually be able to get a difference in mean rating for Apple apps and mean rating for Google apps that's pretty similar to the one we actually got (0.14206) just by: 
a. shuffling the ratings column, 
b. keeping the platform column the same,
c. calculating the difference between the mean rating for Apple and the mean rating for Google. 


3. We can make the shuffle more useful by doing it many times, each time calculating the mean rating for Apple apps and the mean rating for Google apps, and the difference between these means. 


4. We can then take the mean of all these differences, and this will be called our permutation difference. This permutation difference will be great indicator of what the difference would be if our initial assumption were true and platform really doesn’t impact on ratings. 


5. Now we do a comparison. If the observed difference looks just like the permutation difference, then we stick with the claim that actually, platform doesn’t impact on ratings. If instead, however, the permutation difference differs significantly from the observed difference, we'll conclude: something's going on; the platform does in fact impact on ratings. 


6. As for what the definition of *significantly* is, we'll get to that. But there’s a brief summary of what we're going to do. Exciting!

If you want to look more deeply at the statistics behind this project, check out [this resource](https://www.springboard.com/archeio/download/4ea4d453b0b84014bcef287c50f47f00/).

Let's also get a **visual summary** of the `Rating` column, separated by the different platforms. 

A good tool to use here is the boxplot!

In [None]:
import matplotlib.pyplot as plt

# Create a boxplot of ratings for each platform
combined_df.boxplot(column='Rating', by='platform', figsize=(10, 6))

# Set titles and labels
plt.title('Rating Distribution by Platform')
plt.suptitle('')  # Suppress the automatic title
plt.xlabel('Platform')
plt.ylabel('Rating')

# Show the plot
plt.show()

Here we see the same information as in the analytical summary, but with a boxplot. Can you see how the boxplot is working here? If you need to revise your boxplots, check out this this [link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps). 

## Stage 3 - Modelling

### 3a. Hypothesis formulation

Our **Null hypothesis** is just:

**H<sub>null</sub>**: the observed difference in the mean rating of Apple Store and Google Play apps is due to chance (and thus not due to the platform).

The more interesting hypothesis is called the **Alternate hypothesis**:

**H<sub>alternative</sub>**: the observed difference in the average ratings of apple and google users is not due to chance (and is actually due to platform)

We're also going to pick a **significance level** of 0.05. 

### 3b. Getting the distribution of the data
Now that the hypotheses and significance level are defined, we can select a statistical test to determine which hypothesis to accept. 

There are many different statistical tests, all with different assumptions. You'll generate an excellent judgement about when to use which statistical tests over the Data Science Career Track course. But in general, one of the most important things to determine is the **distribution of the data**.   

In [None]:
# Create a subset of the 'Rating' column for Apple apps
apple_ratings = combined_df[combined_df['platform'] == 'apple']['Rating']

# Create a subset of the 'Rating' column for Google apps
google_ratings = combined_df[combined_df['platform'] == 'google']['Rating']

In [None]:
# Using the stats.normaltest() method, get an indication of whether the apple data are normally distributed
# Save the result in a variable called apple_normal, and print it out
# Test if apple ratings data are normally distributed
apple_normal = stats.normaltest(apple_ratings)

# Print the result
print(apple_normal)


In [None]:
# Test if google ratings data are normally distributed
google_normal = stats.normaltest(google_ratings)

# Print the result
print(google_normal)


Since the null hypothesis of the normaltest() is that the data are normally distributed, the lower the p-value in the result of this test, the more likely the data are to be non-normal. 

Since the p-values is 0 for both tests, regardless of what we pick for the significance level, our conclusion is that the data are not normally distributed. 

We can actually also check out the distribution of the data visually with a histogram. A normal distribution has the following visual characteristics:
    - symmetric
    - unimodal (one hump)
As well as a roughly identical mean, median and mode. 

In [None]:
# Create a histogram of the Apple reviews distribution
plt.hist(apple_ratings, bins=20, alpha=0.7, label='Apple Ratings')

# Set the title and labels
plt.title('Distribution of Apple App Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')

# Show the legend
plt.legend()

# Show the plot
plt.show()


In [None]:
# Create a histogram of the Google reviews distribution
plt.hist(google_ratings, bins=20, alpha=0.7, label='Google Ratings', color='green')

# Set the title and labels
plt.title('Distribution of Google App Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')

# Show the legend
plt.legend()

# Show the plot
plt.show()

### 3c. Permutation test
Since the data aren't normally distributed, we're using a *non-parametric* test here. This is simply a label for statistical tests used when the data aren't normally distributed. These tests are extraordinarily powerful due to how few assumptions we need to make.  

Check out more about permutations [here.](http://rasbt.github.io/mlxtend/user_guide/evaluate/permutation_test/)

In [None]:
import numpy as np

# Create a column 'Permutation1' and assign shuffled values of the 'Rating' column
combined_df['Permutation1'] = np.random.permutation(combined_df['Rating'])

# Group by 'platform' and call describe() on 'Permutation1'
permuted_grouped_desc = combined_df.groupby('platform')['Permutation1'].describe()

# Print the descriptive statistics of the permutation grouped by 'platform'
print(permuted_grouped_desc)

In [None]:
# Print the analytical summary obtained from the permutation test
print("Analytical Summary from Permutation Test:")
print(permuted_grouped_desc)

# Print the previous analytical summary
print("\nPrevious Analytical Summary:")
print(grouped_summary)


In [None]:
# First, make a list called difference
difference = []

# Now make a for loop that does the following 10,000 times
for _ in range(10000):
    # 1. Make a permutation of the 'Rating'
    permuted_ratings = np.random.permutation(combined_df['Rating'])
    
    # 2. Calculate the difference in the mean rating for Apple and the mean rating for Google
    apple_mean = permuted_ratings[combined_df['platform'] == 'apple'].mean()
    google_mean = permuted_ratings[combined_df['platform'] == 'google'].mean()
    diff = apple_mean - google_mean
    
    # Append the difference to the list
    difference.append(diff)


In [None]:
# Make a variable called 'histo' and assign to it the result of plotting a histogram of the difference list
histo = plt.hist(difference, bins=30, alpha=0.7, color='blue')

# Set title and labels
plt.title('Distribution of Differences in Mean Ratings')
plt.xlabel('Difference in Mean Ratings (Apple - Google)')
plt.ylabel('Frequency')

# Show the plot
plt.show()


In [None]:
# Calculate the observed difference between the mean ratings of Apple and Google apps
obs_difference = abs(apple_ratings.mean() - google_ratings.mean())

# Print out the observed difference
print(obs_difference)


## Stage 4 -  Evaluating and concluding
### 4a. What is our conclusion?

In [None]:
'''
What do we know? 

Recall: The p-value of our observed data is just the proportion of the data given the null that's at least as extreme as that observed data.

As a result, we're going to count how many of the differences in our difference list are at least as extreme as our observed difference.

If less than or equal to 5% of them are, then we will reject the Null. 
'''
# Calculate the number of differences in the difference list that are at least as extreme as the observed difference
extreme_differences = [diff for diff in difference if abs(diff) >= obs_difference]

# Calculate the proportion of extreme differences
p_value = len(extreme_differences) / len(difference)

# Print the p-value
print("p-value:", p_value)

# Determine whether to reject the null hypothesis
if p_value <= 0.05:
    print("We reject the null hypothesis. Platform significantly impacts ratings.")
else:
    print("We fail to reject the null hypothesis. Platform may not significantly impact ratings.")


### 4b. What is our decision?
So actually, zero differences are at least as extreme as our observed difference!

So the p-value of our observed data is 0. 

It doesn't matter which significance level we pick; our observed data is statistically significant, and we reject the Null.

We conclude that platform does impact on ratings. Specifically, we should advise our client to integrate **only Google Play** into their operating system interface. 

### 4c. Other statistical tests, and next steps
The test we used here is the Permutation test. This was appropriate because our data were not normally distributed! 

As we've seen in Professor Spiegelhalter's book, there are actually many different statistical tests, all with different assumptions. How many of these different statistical tests can you remember? How much do you remember about what the appropriate conditions are under which to use them? 

Make a note of your answers to these questions, and discuss them with your mentor at your next call. 
