In [1]:
import pandas as pd
import numpy as np

**Basic Structure**

In [2]:
df = pd.read_csv('dataset/PlayStoreData.csv')
print("Shape: ",df.shape)

Shape:  (10841, 13)


In [3]:
print(df.head())

                                                 App        Category  Rating  \
0     Photo Editor & Candy Camera & Grid & ScrapBook  ART_AND_DESIGN     4.1   
1                                Coloring book moana  ART_AND_DESIGN     3.9   
2  U Launcher Lite – FREE Live Cool Themes, Hide ...  ART_AND_DESIGN     4.7   
3                              Sketch - Draw & Paint  ART_AND_DESIGN     4.5   
4              Pixel Draw - Number Art Coloring Book  ART_AND_DESIGN     4.3   

  Reviews  Size     Installs  Type Price Content Rating  \
0     159   19M      10,000+  Free     0       Everyone   
1     967   14M     500,000+  Free     0       Everyone   
2   87510  8.7M   5,000,000+  Free     0       Everyone   
3  215644   25M  50,000,000+  Free     0           Teen   
4     967  2.8M     100,000+  Free     0       Everyone   

                      Genres      Last Updated         Current Ver  \
0               Art & Design   January 7, 2018               1.0.0   
1  Art & Design;Pretend 

**missing values and data types**

In [4]:
print('Sum of Missing Values: \n',df.isnull().sum())

Sum of Missing Values: 
 App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64


In [5]:
print(df.dtypes)

App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object


> Info about data : 
- Shape : (10841, 13)    
- conversion of columns Reviews, Size, Installs, Price have type object that requires conversion to int
- columns Rating(1474), Type(1), Content rating(1), Current Ver(8), Android Ver(3) has missing values 


In [6]:
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')

In [7]:
median_rating = df['Rating'].median()

In [8]:
df['Rating'].fillna(median_rating, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Rating'].fillna(median_rating, inplace=True)


In [9]:
df = df[df['Rating'].between(0, 5)]  # Ratings must be 0-5

In [10]:
df['Installs'] = df['Installs'].str.replace(r'[+,]', '', regex=True)

In [11]:
df['Installs'] = pd.to_numeric(df['Installs'], errors='coerce')

In [12]:
def convert_size(size):
    if 'M' in size:
        return float(size.replace('M', ''))           # “19M” → 19.0 MB
    elif 'k' in size:
        return float(size.replace('k', '')) / 1024    # “500k” → 0.48828125 MB
    else:
        return np.nan                                 # “Varies with device” → NaN

In [13]:
df['Size'] = df['Size'].apply(convert_size)

In [14]:
df['Price'] = df['Price'].replace('Free', 0)
df['Price'] = df['Price'].str.replace('$', '').astype(float)

In [15]:
# Drop the rows with critical missing data 
df.dropna(subset=['App', 'Category'], inplace=True)

df.to_csv("processed/PlayStoreData_CLEAN.csv", index=False)
print("New Shape:", df.shape)

New Shape: (10840, 13)


In [16]:
# Calculate MEDIAN size PER CATEGORY (e.g., all GAME apps)
category_medians = df.groupby('Category')['Size'].median()

# Fill missing Size with MEDIAN of its category
df['Size'] = df.apply(
    lambda row: category_medians[row['Category']] 
                if pd.isna(row['Size']) 
                else row['Size'], 
    axis=1
)

In [17]:
df['Reviews'] = df['Reviews'].str.replace(',', '').astype(int)  # "1,000" → 1000

In [18]:
print('Missing Values: \n',df.isnull().sum())

Missing Values: 
 App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              1
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       8
Android Ver       2
dtype: int64


In [19]:
print('Top Categories: \n',df['Category'].value_counts().head(3))

Top Categories: 
 Category
FAMILY    1972
GAME      1144
TOOLS      843
Name: count, dtype: int64


In [20]:
print('Data Types: ',df.dtypes)

Data Types:  App                object
Category           object
Rating            float64
Reviews             int64
Size              float64
Installs            int64
Type               object
Price             float64
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object


> Info after Cleanup
- Shape:  (10840, 13)
- converted required columns from object to integer
- filled missing values :
    - rating : median value
    - size : median calculated category wise

In [21]:
df =pd.read_csv("processed/PlayStoreData_CLEAN.csv")
print(df.sort_values(by='Installs', ascending=False).head())

                            App          Category  Rating   Reviews  Size  \
2808              Google Photos       PHOTOGRAPHY     4.5  10858556   NaN   
2853              Google Photos       PHOTOGRAPHY     4.5  10858538   NaN   
3223  Maps - Navigate & Explore  TRAVEL_AND_LOCAL     4.3   9235373   NaN   
3117  Maps - Navigate & Explore  TRAVEL_AND_LOCAL     4.3   9235155   NaN   
3234                     Google             TOOLS     4.4   8033493   NaN   

        Installs  Type  Price Content Rating          Genres    Last Updated  \
2808  1000000000  Free    0.0       Everyone     Photography  August 6, 2018   
2853  1000000000  Free    0.0       Everyone     Photography  August 6, 2018   
3223  1000000000  Free    0.0       Everyone  Travel & Local   July 31, 2018   
3117  1000000000  Free    0.0       Everyone  Travel & Local   July 31, 2018   
3234  1000000000  Free    0.0       Everyone           Tools  August 3, 2018   

             Current Ver         Android Ver  
2808  Var