<a href="https://colab.research.google.com/github/PhongoD/Pongproj/blob/main/Data_cleaning_gplaystore.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction
 This project focuses on cleaning and preparing the Google Play Store apps dataset for analysis. Clean data is crucial for accurate insights as it ensures the reliability of any conclusions drawn from the analysis. The dataset contains information about Android apps including their ratings, sizes, install counts, and other relevant metrics. This data was collected from Kaggle.

Data Importance
The Google Play Store dataset is valuable because:

It provides insights into user preferences and market trends

Helps developers understand what makes apps successful

Allows for competitive analysis in the mobile app market



#Data Inspection

In [257]:
#import pandas library and load the dataset
import pandas as pd
df = pd.read_csv("gplaystore.csv")

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


At a glance there are several things I noticed. There are 10841 rows, 13 columns. What i can immediately tell whats wrong is theres a chunck of nulls in the rating column and data tpes for review size, and colmumns that should be numerical are objects.

In [258]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,7-Jan-18,1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,15-Jan-18,2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,1-Aug-18,1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,8-Jun-18,Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,20-Jun-18,1.1,4.4 and up


#Data cleaninng


There were some columns which turned out to be redundant. For instance, "Last Updated" contains day and month information, the absence of year makes it impossible to know the real recency of updates, significantly decreasing its usability for analysis. Similarly, "Current Ver" and "Android Ver" were of lesser use for the desired purpose. The "Type" column is binary (free/paid) actually duplicates the content of "Price," apps that are free would return a zero price after cleaning. Next is then deleting "Type." Finally, "Genre" and "Category" appear to be providing duplicate information, so deleting "Genre" to reduce redundancy is the right thing to do. This elimination of unnecessary variables is a means of simplication of the dataset and focusing on the variables with greater defining and informative predictive ability.

In [259]:
df.drop(['Last Updated', 'Current Ver', 'Android Ver', 'Genres','Price'],
        axis=1, inplace=True, errors='ignore')

Handling missing values

In [260]:
#First step is to deal with null values, there many ways like imputation and removal. usually removal if theres a large amount of data missing but, imputation if a small percantage is missing.
df.isnull().sum()


Unnamed: 0,0
App,0
Category,0
Rating,1474
Reviews,0
Size,0
Installs,0
Type,1
Content Rating,1


In [261]:
#the chart above shows the number of missing rows in each column
df.dropna(inplace=True)
df.isnull().sum()

Unnamed: 0,0
App,0
Category,0
Rating,0
Reviews,0
Size,0
Installs,0
Type,0
Content Rating,0


Handling duplicated values

In [262]:
df.duplicated().sum() #find the number of duplicates


np.int64(476)

In [263]:
#474 duplicates
df.drop_duplicates(inplace=True)
#verify
df.duplicated().sum()


np.int64(0)

#Data Type Conversion

Removing '+' from Installs and converting to numeric

Converting Ratings, Reviews, and Installs to numeric

Remiving commas from Installs

Standardizing Size measurements to megabytes

In [264]:

df['Installs'] = df['Installs'].astype(str)
df['Installs'] = df['Installs'].str.rstrip('+')
#remove commas from installs
df['Installs'] = df['Installs'].astype(str)
df['Installs'] = df['Installs'].str.replace(',', '', regex=False)


df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Content Rating
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,10000,Free,Everyone
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,500000,Free,Everyone
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,5000000,Free,Everyone
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,50000000,Free,Teen
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,100000,Free,Everyone


In [265]:

df['Rating'] = pd.to_numeric(df['Rating'],errors ='coerce' )
df['Reviews'] = pd.to_numeric(df['Reviews'],errors='coerce')
df['Installs'] = pd.to_numeric(df['Installs'],errors='coerce')
df.info()








<class 'pandas.core.frame.DataFrame'>
Index: 8890 entries, 0 to 10840
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8890 non-null   object 
 1   Category        8890 non-null   object 
 2   Rating          8890 non-null   float64
 3   Reviews         8890 non-null   int64  
 4   Size            8890 non-null   object 
 5   Installs        8890 non-null   int64  
 6   Type            8890 non-null   object 
 7   Content Rating  8890 non-null   object 
dtypes: float64(1), int64(2), object(5)
memory usage: 625.1+ KB


In [266]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Content Rating
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,10000,Free,Everyone
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,500000,Free,Everyone
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,5000000,Free,Everyone
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,50000000,Free,Teen
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,100000,Free,Everyone


To prepare the 'Size' column for conversion to a numerical data type, it is necessary to handle the non-numeric entries represented by the string "Varies with device".

In [267]:
#Check and verify to see if the data has changed
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8890 entries, 0 to 10840
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8890 non-null   object 
 1   Category        8890 non-null   object 
 2   Rating          8890 non-null   float64
 3   Reviews         8890 non-null   int64  
 4   Size            8890 non-null   object 
 5   Installs        8890 non-null   int64  
 6   Type            8890 non-null   object 
 7   Content Rating  8890 non-null   object 
dtypes: float64(1), int64(2), object(5)
memory usage: 625.1+ KB


In [268]:

#Rename the "Size" column to "Size (megabytes)" and remove the "M" from the data within that column
df.rename(columns={'Size': 'Size (megabytes)'}, inplace=True)
# Convert 'Size (megabytes)' to string before applying str.replace
df['Size (megabytes)'] = df['Size (megabytes)'].astype(str)
df['Size (megabytes)'] = df['Size (megabytes)'].str.replace('M', '', regex=False)
#Very small amount of rows had k for kilobytes so were also goin to remove the k
df['Size (megabytes)'] = df['Size (megabytes)'].str.replace('k', '', regex=False)

In [269]:
#change the data type of the Size cell
df['Size (megabytes)'] = pd.to_numeric(df['Size (megabytes)'], errors='coerce')
df['Size (megabytes)'] = df['Size (megabytes)'].apply(lambda x: x / 1024 if x > 1000 else x)

In [270]:
#Check and verify to see if the data has changed
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8890 entries, 0 to 10840
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   App               8890 non-null   object 
 1   Category          8890 non-null   object 
 2   Rating            8890 non-null   float64
 3   Reviews           8890 non-null   int64  
 4   Size (megabytes)  7422 non-null   float64
 5   Installs          8890 non-null   int64  
 6   Type              8890 non-null   object 
 7   Content Rating    8890 non-null   object 
dtypes: float64(2), int64(2), object(4)
memory usage: 625.1+ KB


Size has missing values

In [271]:
#drop nulls in size
df.dropna(inplace=True)
df.isnull().sum()

Unnamed: 0,0
App,0
Category,0
Rating,0
Reviews,0
Size (megabytes),0
Installs,0
Type,0
Content Rating,0


In [272]:
df.head()


Unnamed: 0,App,Category,Rating,Reviews,Size (megabytes),Installs,Type,Content Rating
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,10000,Free,Everyone
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,500000,Free,Everyone
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,5000000,Free,Everyone
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25.0,50000000,Free,Teen
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,100000,Free,Everyone


In [273]:
#Now the data is ready to be exported for analysis
df.to_csv('gplaystore_cleaned.csv', index=False)


#Conclusion
In general, the cleaning process of this dataset has highlighted the need for attention to detail and significant time investment. By different ways of solving missing ratings, variable size entries, and getting the 'Installs' column ready for numerical treatment, it becomes evident that cleaning data is not a  step you can skip over. With out it it would  compromise the validity and reliability of analysis.

#What questions could be solved?
What is the distribution of app installs across different categories?

Are free apps generally more popular (have more installs) than paid apps?

Which app categories are most popular (high installs) and least saturated (fewer apps)?

Which app categories attract the largest user base?

Can we predict an app's rating based on its number of reviews, number of installs, and size?

These app-related business questions can be addressed by analyzing relationships and distributions within the data using linear and multilinear  regression techniques and insightful visualizations.
