<a href="https://colab.research.google.com/github/AnamHJ24/datascience-python-challenges/blob/main/notebooks/Day_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Day 10 - Apple
You are a Product Analyst for the **Apple** App Store team investigating app ratings data. Your
focus is to clean and understand rating distributions across different app categories. The team
wants to leverage basic statistical insights to guide app performance strategies.


In [10]:
# Import required libraries
import pandas as pd
import numpy as np

# import data file
url = "https://raw.githubusercontent.com/AnamHJ24/datascience-python-challenges/refs/heads/main/Data/Day_10.txt"
app_ratings = pd.read_csv(url)
app_ratings.head(10)

Unnamed: 0,app_id,rating,category,review_date
0,app001,4.5,Games,2024-07-05
1,app002,3.9,Productivity,2024-07-06
2,app001,4.7,Games,2024-07-10
3,app003,4.0,Health & Fitness,2024-08-15
4,app004,five,Education,2024-09-01
5,app005,,Games,2024-10-11
6,app006,4.2,Lifestyle,2024-10-20
7,app007,4,Utilities,2024-11-15
8,app008,3.5,Entertainment,2024-12-01
9,app009,4.9,Health & Fitness,2024-12-15


## Question 1
There are some data inconsistencies in the 'rating' column, specifically: leading or trailing white space,
decimals represented by commas instead of decimal points (eg. 4,2 instead of 4.2), and non-numeric
values. Clean up these data issues and convert the column to a numeric data type.

## Solution

In [13]:
# Clean the 'rating' column and convert to numeric
app_ratings['rating_clean'] = app_ratings['rating'].str.strip()
app_ratings['rating_clean'] = app_ratings['rating_clean'].str.replace(' ','')
app_ratings['rating_clean'] = app_ratings['rating_clean'].str.replace(',','.', regex = False)
app_ratings['rating_clean'] = pd.to_numeric(app_ratings['rating_clean'], errors = "coerce")
print(app_ratings.head(10))

   app_id rating          category review_date  rating_clean
0  app001    4.5             Games  2024-07-05           4.5
1  app002    3.9      Productivity  2024-07-06           3.9
2  app001    4.7             Games  2024-07-10           4.7
3  app003   4.0   Health & Fitness  2024-08-15           4.0
4  app004   five         Education  2024-09-01           NaN
5  app005    NaN             Games  2024-10-11           NaN
6  app006    4.2         Lifestyle  2024-10-20           4.2
7  app007      4         Utilities  2024-11-15           4.0
8  app008    3.5     Entertainment  2024-12-01           3.5
9  app009    4.9  Health & Fitness  2024-12-15           4.9


## Question 2
Using the cleaned dataset, display the first and last five entries to get an overview of the app ratings
across different categories.

## Solution

In [14]:
# First five entries
first_five = app_ratings.head()

# Last fuve entries
last_five = app_ratings.tail()

# Display result
print("First and last five entries:\n")
print(pd.concat([first_five, last_five]))

First and last five entries:

    app_id rating          category review_date  rating_clean
0   app001    4.5             Games  2024-07-05           4.5
1   app002    3.9      Productivity  2024-07-06           3.9
2   app001    4.7             Games  2024-07-10           4.7
3   app003   4.0   Health & Fitness  2024-08-15           4.0
4   app004   five         Education  2024-09-01           NaN
45  app006    4.0         Lifestyle  2024-07-15           4.0
46  app007    NaN         Utilities  2024-07-16           NaN
47  app008    4.6     Entertainment  2024-07-17           4.6
48  app009    3.9  Health & Fitness  2024-07-18           3.9
49  app010    4.1             Games  2024-07-19           4.1


## Question 3
Calculate the basic summary statistics (mean, median, standard deviation) of app ratings for each
category to identify variations and performance patterns.

## Solution

In [21]:
# Calculate summary statitics
mean_ratings = app_ratings.groupby('category')['rating_clean'].mean( )
print('\nMEAN:',round(mean_ratings, 2))
median_ratings = app_ratings.groupby('category')['rating_clean'].median()
print('\nMEDIAN:',round(median_ratings, 2))
std_ratings = app_ratings.groupby('category')['rating_clean'].std()
print('\nSTANDARD DEVIATION:',round(std_ratings, 2))


MEAN: category
Education           4.38
Entertainment       4.22
Games               4.19
Health & Fitness    4.24
Lifestyle           4.03
Productivity        3.78
Utilities           3.72
Name: rating_clean, dtype: float64

MEDIAN: category
Education           4.30
Entertainment       4.20
Games               4.35
Health & Fitness    4.15
Lifestyle           4.05
Productivity        3.80
Utilities           3.65
Name: rating_clean, dtype: float64

STANDARD DEVIATION: category
Education           0.38
Entertainment       0.51
Games               0.88
Health & Fitness    0.38
Lifestyle           0.17
Productivity        0.19
Utilities           0.72
Name: rating_clean, dtype: float64
