Day 10 of Python Summer Party

by Interview Master

Apple

App Store Ratings Performance by App Category

You are a Product Analyst for the Apple App Store team investigating app ratings data. Your focus is to clean and understand rating distributions across different app categories. The team wants to leverage basic statistical insights to guide app performance strategies.

In [1]:
import numpy as np
import pandas as pd


In [2]:
# Load the CSV file into a DataFrame and display it
app_ratings = pd.read_csv('app_ratings.csv')
app_ratings_df = app_ratings.copy()
print(app_ratings_df)
print()
print(app_ratings_df.info())


    app_id         rating          category review_date
0   app001            4.5             Games  2024-07-05
1   app002            3.9      Productivity  2024-07-06
2   app001            4.7             Games  2024-07-10
3   app003           4.0   Health & Fitness  2024-08-15
4   app004           five         Education  2024-09-01
5   app005            NaN             Games  2024-10-11
6   app006            4.2         Lifestyle  2024-10-20
7   app007              4         Utilities  2024-11-15
8   app008            3.5     Entertainment  2024-12-01
9   app009            4.9  Health & Fitness  2024-12-15
10  app010            4,2             Games  2025-01-07
11  app011            3.5      Productivity  2025-01-15
12  app012            4.0         Education  2025-01-20
13  app013            2.1             Games  2025-02-14
14  app014            3.8         Lifestyle  2025-02-20
15  app015            4.5             Games  2025-03-03
16  app016            3.3         Utilities  202

Question 1 of 3

There are some data inconsistencies in the 'rating' column, specifically: leading or trailing white space, decimals represented by commas instead of decimal points (eg. 4,2 instead of 4.2), and non-numeric values. Clean up these data issues and convert the column to a numeric data type.



In [3]:
# Lets first find out these inconsistencies by checking the unique values in the user_id column
print(app_ratings_df['rating'].unique())


['4.5' '3.9' '4.7' ' 4.0 ' 'five' nan '4.2' '4' '3.5' '4.9' '4,2' '4.0'
 '2.1' '3.8' '3.3' '4.8' '4.6' '3.2' '3.7' '5' '6.0' '4.1' '3.0' '4.4'
 '4.3' 'not available']


In [4]:
# Yeah, we can confirm there are leading and trailing spaces, missing values, the datatype is object instead of numeric, and there are some invalid entries like 'five' and 'not available'.
# Lets start cleaning the data by removing leading and trailing spaces, converting to numeric and handling missing and invalid values.
# While we are at it we can also convert the review_date to datetime format

# Normalizing and cleaning rating column
app_ratings_df['rating'] = app_ratings_df['rating'].str.lower()
app_ratings_df['rating'] = app_ratings_df['rating'].str.lower().str.strip()
print("'rating' with leading and trailing spaces removed and converted to lowercase:")
print(app_ratings_df['rating'].unique())


'rating' with leading and trailing spaces removed and converted to lowercase:
['4.5' '3.9' '4.7' '4.0' 'five' nan '4.2' '4' '3.5' '4.9' '4,2' '2.1'
 '3.8' '3.3' '4.8' '4.6' '3.2' '3.7' '5' '6.0' '4.1' '3.0' '4.4' '4.3'
 'not available']


In [5]:
# We got rid of the leading and trailing spaces. Now lets convert to numeric, and transform the commas to dots
app_ratings_df['rating'] = app_ratings_df['rating'].str.replace(',', '.')
print("'rating' with commas replaced by dots:")
print(app_ratings_df['rating'].unique())

# Converting rating to numeric, setting errors to NaN we will not be filling them with 0 because it is possible that some apps have not been rated yet
app_ratings_df['rating'] = pd.to_numeric(app_ratings_df['rating'], errors='coerce')
print(app_ratings_df.info())
print("'rating' after converting to numeric and setting errors to NaN:")
print(app_ratings_df['rating'].unique())

# Converting review_date to datetime format
app_ratings_df['review_date'] = pd.to_datetime(app_ratings_df['review_date'], format='%Y-%m-%d', errors='coerce')
print("'review_date' after converting to datetime format:")
print(app_ratings_df.info())
print()


'rating' with commas replaced by dots:
['4.5' '3.9' '4.7' '4.0' 'five' nan '4.2' '4' '3.5' '4.9' '2.1' '3.8'
 '3.3' '4.8' '4.6' '3.2' '3.7' '5' '6.0' '4.1' '3.0' '4.4' '4.3'
 'not available']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   app_id       50 non-null     object 
 1   rating       45 non-null     float64
 2   category     50 non-null     object 
 3   review_date  50 non-null     object 
dtypes: float64(1), object(3)
memory usage: 1.7+ KB
None
'rating' after converting to numeric and setting errors to NaN:
[4.5 3.9 4.7 4.  nan 4.2 3.5 4.9 2.1 3.8 3.3 4.8 4.6 3.2 3.7 5.  6.  4.1
 3.  4.4 4.3]
'review_date' after converting to datetime format:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----     

In [6]:
# Cleaned dataframe
cleaned_app_ratings_df = app_ratings_df.copy()
print("Answer 1: Cleaned dataframe with normalized 'rating' column and 'review_date' in datetime format:")
print(cleaned_app_ratings_df)
print()
print(cleaned_app_ratings_df.info())


Answer 1: Cleaned dataframe with normalized 'rating' column and 'review_date' in datetime format:
    app_id  rating          category review_date
0   app001     4.5             Games  2024-07-05
1   app002     3.9      Productivity  2024-07-06
2   app001     4.7             Games  2024-07-10
3   app003     4.0  Health & Fitness  2024-08-15
4   app004     NaN         Education  2024-09-01
5   app005     NaN             Games  2024-10-11
6   app006     4.2         Lifestyle  2024-10-20
7   app007     4.0         Utilities  2024-11-15
8   app008     3.5     Entertainment  2024-12-01
9   app009     4.9  Health & Fitness  2024-12-15
10  app010     4.2             Games  2025-01-07
11  app011     3.5      Productivity  2025-01-15
12  app012     4.0         Education  2025-01-20
13  app013     2.1             Games  2025-02-14
14  app014     3.8         Lifestyle  2025-02-20
15  app015     4.5             Games  2025-03-03
16  app016     3.3         Utilities  2025-03-12
17  app017     4.8  

Question 2:

Using the cleaned dataset, display the first and last five entries to get an overview of the app ratings across different categories.



In [7]:
# This can easily be achieved by using .head and .tail 
print("Showing first 5 entries:")
print(cleaned_app_ratings_df.head(5))
print()
print("Showing last 5 entries:")
print(cleaned_app_ratings_df.tail(5))


Showing first 5 entries:
   app_id  rating          category review_date
0  app001     4.5             Games  2024-07-05
1  app002     3.9      Productivity  2024-07-06
2  app001     4.7             Games  2024-07-10
3  app003     4.0  Health & Fitness  2024-08-15
4  app004     NaN         Education  2024-09-01

Showing last 5 entries:
    app_id  rating          category review_date
45  app006     4.0         Lifestyle  2024-07-15
46  app007     NaN         Utilities  2024-07-16
47  app008     4.6     Entertainment  2024-07-17
48  app009     3.9  Health & Fitness  2024-07-18
49  app010     4.1             Games  2024-07-19


Question 3:

Calculate the basic summary statistics (mean, median, standard deviation) of app ratings for each category to identify variations and performance patterns.

In [8]:
# For this first we need to group the data by category and then calculate the statistics for each category

# Group the data by category and calculate the mean rating for each category
grouped_app_ratings_df = cleaned_app_ratings_df.groupby('category')['rating'].describe(include="all")
print("Descriptive statistics:")
print(grouped_app_ratings_df)


Descriptive statistics:
                  count      mean       std  min    25%   50%    75%  max
category                                                                 
Education           5.0  4.380000  0.376829  4.0  4.200  4.30  4.400  5.0
Entertainment       5.0  4.220000  0.511859  3.5  4.000  4.20  4.600  4.8
Games              14.0  4.185714  0.884792  2.1  3.950  4.35  4.575  6.0
Health & Fitness    8.0  4.237500  0.381491  3.8  3.975  4.15  4.450  4.9
Lifestyle           4.0  4.025000  0.170783  3.8  3.950  4.05  4.125  4.2
Productivity        5.0  3.780000  0.192354  3.5  3.700  3.80  3.900  4.0
Utilities           4.0  3.725000  0.718215  3.0  3.225  3.65  4.150  4.6


In [9]:
# Note: pandas and numpy are already imported as pd and np
# The following tables are loaded as pandas DataFrames with the same names: app_ratings
# Please print your final result or dataframe

################################################################################
print()
print("=" * 150)
print("=" * 150)
print()
################################################################################
# Question 1 of 3 
# There are some data inconsistencies in the 'rating' column, specifically: leading or trailing white space, decimals represented by commas instead of decimal points (eg. 4,2 instead of 4.2), and non-numeric values. Clean up these data issues and convert the column to a numeric data type.

# Load the CSV file into a DataFrame and display it
app_ratings_df = app_ratings.copy()
print(app_ratings_df)
print()
print(app_ratings_df.info())
print()
print("=" * 150)

# Lets first find out these inconsistencies by checking the unique values in the user_id column
print(app_ratings_df['rating'].unique())
print()
print("=" * 150)

# Yeah, we can confirm there are leading and trailing spaces, missing values, the datatype is object instead of numeric, and there are some invalid entries like 'five' and 'not available'.
# Lets start cleaning the data by removing leading and trailing spaces, converting to numeric and handling missing and invalid values.
# While we are at it we can also convert the review_date to datetime format

# Normalizing and cleaning rating column
app_ratings_df['rating'] = app_ratings_df['rating'].str.lower()
app_ratings_df['rating'] = app_ratings_df['rating'].str.lower().str.strip()
print("'rating' with leading and trailing spaces removed and converted to lowercase:")
print(app_ratings_df['rating'].unique())
print()
print("=" * 150)

# We got rid of the leading and trailing spaces. Now lets convert to numeric, and transform the commas to dots
app_ratings_df['rating'] = app_ratings_df['rating'].str.replace(',', '.')
print("'rating' with commas replaced by dots:")
print(app_ratings_df['rating'].unique())
print()
print("=" * 150)

# Converting rating to numeric, setting errors to NaN we will not be filling them with 0 because it is possible that some apps have not been rated yet
app_ratings_df['rating'] = pd.to_numeric(app_ratings_df['rating'], errors='coerce')
print(app_ratings_df.info())
print("'rating' after converting to numeric and setting errors to NaN:")
print(app_ratings_df['rating'].unique())
print()
print("=" * 150)

# Converting review_date to datetime format
app_ratings_df['review_date'] = pd.to_datetime(app_ratings_df['review_date'], format='%Y-%m-%d', errors='coerce')
print("'review_date' after converting to datetime format:")
print(app_ratings_df.info())
print()
print()
print("=" * 150)

# Cleaned dataframe
cleaned_app_ratings_df = app_ratings_df.copy()
print("Answer 1: Cleaned dataframe with normalized 'rating' column and 'review_date' in datetime format:")
print(cleaned_app_ratings_df)
print()
print(cleaned_app_ratings_df.info())
print()
print("=" * 150)

################################################################################
print()
print("=" * 150)
print("=" * 150)
print()
################################################################################
# Question 2 of 3 
# Using the cleaned dataset, display the first and last five entries to get an overview of the app ratings across different categories.

# This can easily be achieved by using .head and .tail 
print("Showing first 5 entries:")
print(cleaned_app_ratings_df.head(5))
print()
print("Showing last 5 entries:")
print(cleaned_app_ratings_df.tail(5))
print()
print("=" * 150)

################################################################################
print()
print("=" * 150)
print("=" * 150)
print()
################################################################################
# Question 3 of 3
# Calculate the basic summary statistics (mean, median, standard deviation) of app ratings for each category to identify variations and performance patterns.

# For this first we need to group the data by category and then calculate the statistics for each category

# Group the data by category and calculate the mean rating for each category
grouped_app_ratings_df = cleaned_app_ratings_df.groupby('category')['rating'].describe(include="all")
print("Descriptive statistics:")
print(grouped_app_ratings_df)




    app_id         rating          category review_date
0   app001            4.5             Games  2024-07-05
1   app002            3.9      Productivity  2024-07-06
2   app001            4.7             Games  2024-07-10
3   app003           4.0   Health & Fitness  2024-08-15
4   app004           five         Education  2024-09-01
5   app005            NaN             Games  2024-10-11
6   app006            4.2         Lifestyle  2024-10-20
7   app007              4         Utilities  2024-11-15
8   app008            3.5     Entertainment  2024-12-01
9   app009            4.9  Health & Fitness  2024-12-15
10  app010            4,2             Games  2025-01-07
11  app011            3.5      Productivity  2025-01-15
12  app012            4.0         Education  2025-01-20
13  app013            2.1             Games  2025-02-14
14  app014            3.8         Lifestyle  2025-02-20
15  app015            4.5             Games  2025-03-03
16  app016            3.3         Utilities  2