# Google Play Store APPs Data Analysis

Note: This project was completed as a self-learning exercise.
I received guidance on how to use Python libraries for data loading and visualization,
but all data analysis and interpretation are my own work.

Used Libraries
1. NumPy (Numerical Python)
2. Pandas
3. Matplotlib

In [1]:
# Importing Necessary Libraries

#Data
import numpy as np
import pandas as pd
data = pd.read_csv("googleplaystore.csv")
data.head()

# Visualization
import matplotlib.pyplot as plt

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


## Content: 
1. Data Overview
2. Pandas Data Cleaning
3. Matplotlib Data Visualization

### Step 1: Data Overview

In [2]:
data.shape

(10841, 13)

**Interpretation on Data Structure:**
The dataset includes 10,841 observations and 13 variables. Each observation corresponds to one recorded Android application. 
Each row represents a unique Android application, while each column provides a specific attribute of the app, such as its category, rating, number of reviews, sizes, etc. 
This structure allows for both descriptive and comparative statistical analysis. 
This project will include both descriptive and comparative analysis. 

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [4]:
data.describe()

Unnamed: 0,Rating
count,9367.0
mean,4.193338
std,0.537431
min,1.0
25%,4.0
50%,4.3
75%,4.5
max,19.0


**Interpretation on Data Types:**
Among these variables, only Rating is stored as a `float64` data type(numeric), while others are stored as the `object` data type (strings). 
This indicates that numerical-looking columns, such as *Reviews*, *Installs*, and *Size*, will require conversion in the data cleaning stage. 
Additionally, the columns *Rating*, *Type*, *Content Rating*, *Current Ver*, and *Android Ver* contain missing values, as shown by the non-null counts being lower than the total number of entries (10841). 

### Step 2: Data Cleaning
Goal: Ensure that key numeric columns (*Reviews*, *Installs*, *Size*) are valid and get them ready for conversion. 

### Step 2.1: Validating, Cleaning, and Converting the *Reviews* column
This step ensures that the `Reviews` column contains only valid numeric values that can be used for quantitative analysis.

In [5]:
# 1. Preview for data types and first five rows
data["Reviews"].head()

0       159
1       967
2     87510
3    215644
4       967
Name: Reviews, dtype: object

Purpose: To preview the column and verify its data type (`object`), which means the values are stored as text and need conversion.

In [6]:
# 2. Detect invalid entries (non-numeric)
mask_reviews = data["Reviews"].str.contains("[^0-9]", regex=True, na=False)
data[mask_reviews].head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


We use a regular expression to find entries that contain any non-numeric character, which may cause conversion errors (e.g., "3.0M").

In [7]:
# 3. Count how many will become NaN if converted
invalid_reviews = int(pd.to_numeric(data["Reviews"], errors="coerce").isna().sum())
invalid_reviews

1

Instead of previewing, here we quantify the issue by simulating conversion.
`pd.to_numeric(..., errors="coerce")` temporarily converts all values to numbers, replacing invalid ones with NaN.
By summing `.isna()`, we can count how many entries would become NaN if converted.

In [8]:
# 4. Convert column to numeric type
data["Reviews"] = pd.to_numeric(data["Reviews"], errors="coerce")

The command `pd.to_numeric(data["Reviews"], errors="coerce")` was applied to convert all valid entries into numeric data.  
The single invalid value was automatically converted to `NaN`.  
This ensures that the *Reviews* column can now be used for quantitative analysis.

In [9]:
# 5. Verify conversion
data["Reviews"].dtype
data["Reviews"].isna().sum()

np.int64(1)

After validation, only one invalid entry ("3.0M") was found in the *Reviews* column.  
The column was converted into a numeric data type using `pd.to_numeric()`.  
The single invalid value was automatically replaced with `NaN`.  
This ensures that the *Reviews* column is now suitable for quantitative analysis.

### Step 2.2: Validating, Cleaning, and Converting the *Installs* column¶

In [10]:
# 1. Preview
data["Installs"].head()

0        10,000+
1       500,000+
2     5,000,000+
3    50,000,000+
4       100,000+
Name: Installs, dtype: object

In [11]:
# 2. Remove symbols
data["Installs"] = data["Installs"].str.replace("+", "", regex=False)
data["Installs"] = data["Installs"].str.replace(",", "", regex=False)

In [12]:
# 3. Detect and count invalid values
invalid_installs = int(pd.to_numeric(data["Installs"], errors="coerce").isna().sum())
invalid_installs

1

In [13]:
# 4. Convert column to numeric
data["Installs"] = pd.to_numeric(data["Installs"], errors="coerce")

In [14]:
# 5. Verify convertion
data["Installs"].dtype
data["Installs"].isna().sum()

np.int64(1)

### Step 2.3: Validating, Cleaning, and Converting the *Size* column

In [15]:
# 1. Preview typical entries and confirm data type (object)
data["Size"].head(10)

# Checking unique and 20 random samples helps us see that values mix letters ("M")
# and numbers(integers and floats), 
# and that some entries contain "Varies with device" instead of numeric data.
data["Size"].unique()[:20]
data["Size"].sample(20)

8845                   2.6M
4084     Varies with device
3456     Varies with device
342      Varies with device
698                    9.3M
10751                   44k
9197                   3.9M
6137                   4.1M
6108                   2.5M
6492                   8.0M
2484                   7.1M
7134                   5.6M
5615                    46M
3330     Varies with device
6632                   1.2M
10041                  170k
3788     Varies with device
3572     Varies with device
1002                    77M
3472                    15M
Name: Size, dtype: object

In [16]:
int((data["Size"].str.lower() == "varies with device").sum())

# Count how many entries contain the phrase "Varies with device" (case-insensitive). 
# These will be replaced with NaN in the next step.

1695

There are 1,695 entries labeled "Varies with device", which cannot be converted to numeric values.

In [17]:
# 2. Replace the non-numeric text
data["Size"] = data["Size"].replace("Varies with device", np.nan)

# Replace text-based entries ("Varies with device") with NaN
# so that the column only contains numeric-like values or missing values.

In [18]:
# 3. Convert size units
def size_to_mb(size):
    if pd.isna(size):
        return np.nan
    elif "M" in size:
        return float(size.replace("M",""))
    elif size.replace('.', '', 1).isdigit():
        return float(size)
    else: 
        return np.nan

# Since most entries use megabytes ("M"), 
# all values are standardized in MB to maintain unit consistency.
# Convert size strings into numeric (MB) values:
# - If entry contains "M", remove the letter and convert to float.
# - If entry is already a numeric string (e.g., "5.4"), keep it as float.
# - Otherwise, return NaN.

In [19]:
# 4. Apply function size_to_mb into all size
data["Size"]=data["Size"].apply(size_to_mb)

Apply the conversion function to all entries.
Then verify that the column is now numeric and count total NaN values (after replacement).

In [20]:
# 5. Verify Conversion
data["Size"].dtype
int(data["Size"].isna().sum())

2012

**Verification Note:**  
Before cleaning, 1695 entries in the *Size* column were labeled as `"Varies with device"`.  
After replacement, the total number of missing (NaN) values became 2012,  
which includes both the replaced text entries and the originally missing values.  
This confirms that the text replacement and missing value handling were successfully applied.

### Step 3: Descriptive Analysis
After cleaning, key numeric columns such as *Rating*, *Reviews*, *Installs*, and *Size* can now be used for descriptive and comparative statistical analysis.

### Step 3.1: Central trends

In [21]:
data[["Rating", "Reviews", "Installs", "Size"]].describe()

Unnamed: 0,Rating,Reviews,Installs,Size
count,9367.0,10840.0,10840.0,8829.0
mean,4.193338,444152.9,15464340.0,22.27054
std,0.537431,2927761.0,85029360.0,22.628691
min,1.0,0.0,0.0,1.0
25%,4.0,38.0,1000.0,5.4
50%,4.3,2094.0,100000.0,14.0
75%,4.5,54775.5,5000000.0,31.0
max,19.0,78158310.0,1000000000.0,100.0


### Interpretation
the mean is 

### Step 3.2: Visualization
To better understand distributions and outliers, I will visualize key numeric columns.