__Hello future employer!__

This is an example of my knowledge in data cleaning (or scrubbing if you'd prefer). Try not only to see it as knowledge of a programming language but also my thought process when it comes to 'conversing' with data and determining it usefulness to the task at hand.  

In [4]:
import pandas as pd

# Load the data
file_path = "/workspaces/codespaces-jupyter/Data_Cleaning_GooglePlay/GooglePlayStore Excel.csv"
df = pd.read_csv(file_path)

# Display the first few rows of the dataframe to inspect the data
df.head()


Unnamed: 0,App Name,App Id,Category,Rating,Rating Count,Installs,Minimum Installs,Maximum Installs,Free,Price,...,Developer Website,Developer Email,Released,Last Updated,Content Rating,Privacy Policy,Ad Supported,In App Purchases,Editors Choice,Scraped Time
0,Gakondo,com.ishakwe.gakondo,Adventure,0.0,0.0,10+,10.0,15,True,0.0,...,https://beniyizibyose.tk/#/,jean21101999@gmail.com,"Feb 26, 2020","Feb 26, 2020",Everyone,https://beniyizibyose.tk/projects/,False,False,False,15/06/2021 20:19
1,Ampere Battery Info,com.webserveis.batteryinfo,Tools,4.4,64.0,"5,000+",5000.0,7662,True,0.0,...,https://webserveis.netlify.app/,webserveis@gmail.com,"May 21, 2020","May 06, 2021",Everyone,https://dev4phones.wordpress.com/licencia-de-uso/,True,False,False,15/06/2021 20:19
2,Vibook,com.doantiepvien.crm,Productivity,0.0,0.0,50+,50.0,58,True,0.0,...,,vnacrewit@gmail.com,"Aug 9, 2019","Aug 19, 2019",Everyone,https://www.vietnamairlines.com/vn/en/terms-an...,False,False,False,15/06/2021 20:19
3,Smart City Trichy Public Service Vehicles 17UC...,cst.stJoseph.ug17ucs548,Communication,5.0,5.0,10+,10.0,19,True,0.0,...,http://www.climatesmarttech.com/,climatesmarttech2@gmail.com,"Sep 10, 2018","Oct 13, 2018",Everyone,,True,False,False,15/06/2021 20:19
4,GROW.me,com.horodyski.grower,Tools,0.0,0.0,100+,100.0,478,True,0.0,...,http://www.horodyski.com.pl,rmilekhorodyski@gmail.com,"Feb 21, 2020","Nov 12, 2018",Everyone,http://www.horodyski.com.pl,False,False,False,15/06/2021 20:19


We've acquired a dataset extracted from the Google Play store. For those who have navigated any popular application store, it's evident that beyond the initial curtain of advertisements of the opening screen and the search bar that it can be chaotic. Chaos, one could argue, that is attributed to user interactivity and the freedom (or ability) to publish apps on the store. Therefor this is a great opportunity to showcase some skills and discussion with Python


__Data Cleaning__

In [5]:
# Display the column names
column_names = df.columns
print("Column Names:")
print(column_names)



Column Names:
Index(['App Name', 'App Id', 'Category', 'Rating', 'Rating Count', 'Installs',
       'Minimum Installs', 'Maximum Installs', 'Free', 'Price', 'Currency',
       'Size', 'Minimum Android', 'Developer Id', 'Developer Website',
       'Developer Email', 'Released', 'Last Updated', 'Content Rating',
       'Privacy Policy', 'Ad Supported', 'In App Purchases', 'Editors Choice',
       'Scraped Time'],
      dtype='object')


In [6]:
#Lets start off dropping a column not important for analysis
# Drop the 'Scraped Time' column
df = df.drop(columns=['Scraped Time'], axis=1)

# Display the first few rows of the dataframe to inspect the data
df.head()


Unnamed: 0,App Name,App Id,Category,Rating,Rating Count,Installs,Minimum Installs,Maximum Installs,Free,Price,...,Developer Id,Developer Website,Developer Email,Released,Last Updated,Content Rating,Privacy Policy,Ad Supported,In App Purchases,Editors Choice
0,Gakondo,com.ishakwe.gakondo,Adventure,0.0,0.0,10+,10.0,15,True,0.0,...,Jean Confident Irénée NIYIZIBYOSE,https://beniyizibyose.tk/#/,jean21101999@gmail.com,"Feb 26, 2020","Feb 26, 2020",Everyone,https://beniyizibyose.tk/projects/,False,False,False
1,Ampere Battery Info,com.webserveis.batteryinfo,Tools,4.4,64.0,"5,000+",5000.0,7662,True,0.0,...,Webserveis,https://webserveis.netlify.app/,webserveis@gmail.com,"May 21, 2020","May 06, 2021",Everyone,https://dev4phones.wordpress.com/licencia-de-uso/,True,False,False
2,Vibook,com.doantiepvien.crm,Productivity,0.0,0.0,50+,50.0,58,True,0.0,...,Cabin Crew,,vnacrewit@gmail.com,"Aug 9, 2019","Aug 19, 2019",Everyone,https://www.vietnamairlines.com/vn/en/terms-an...,False,False,False
3,Smart City Trichy Public Service Vehicles 17UC...,cst.stJoseph.ug17ucs548,Communication,5.0,5.0,10+,10.0,19,True,0.0,...,Climate Smart Tech2,http://www.climatesmarttech.com/,climatesmarttech2@gmail.com,"Sep 10, 2018","Oct 13, 2018",Everyone,,True,False,False
4,GROW.me,com.horodyski.grower,Tools,0.0,0.0,100+,100.0,478,True,0.0,...,Rafal Milek-Horodyski,http://www.horodyski.com.pl,rmilekhorodyski@gmail.com,"Feb 21, 2020","Nov 12, 2018",Everyone,http://www.horodyski.com.pl,False,False,False


Now we can start inspecting the data a little more horizontally and tick the boxes for cleaning.

In [7]:
# Check for duplicates of entire rows
duplicates = df[df.duplicated()]

# Display the duplicates, if any
print("Duplicate Rows:")
print(duplicates)

Duplicate Rows:
Empty DataFrame
Columns: [App Name, App Id, Category, Rating, Rating Count, Installs, Minimum Installs, Maximum Installs, Free, Price, Currency, Size, Minimum Android, Developer Id, Developer Website, Developer Email, Released, Last Updated, Content Rating, Privacy Policy, Ad Supported, In App Purchases, Editors Choice]
Index: []

[0 rows x 23 columns]


No DISTINCT duplicates but we should investigate the variables (i.e. column) more closely. Lets check the variable "app names" for a start. 

In [8]:
# Identify duplicate app names
duplicate_apps = df[df.duplicated(subset='App Name', keep=False)]

# Count the number of occurrences for each duplicated 'App Name'
duplicate_counts = duplicate_apps['App Name'].value_counts()

# Display the title
print("Duplicates Highest to Lowest:")

# Display the total number of duplicates
total_duplicates = duplicate_counts.sum()
print(f"Total Number of Duplicates: {total_duplicates}\n")

# Display the 'App Name' and count of duplicates sorted by count in descending order
for app_name, count in duplicate_counts.sort_values(ascending=False).items():
    print(f"App Name: {app_name}, Duplicate Count: {count}")




Duplicates Highest to Lowest:
Total Number of Duplicates: 63178

App Name: Tic Tac Toe, Duplicate Count: 175
App Name: Calculator, Duplicate Count: 130
App Name: Flashlight, Duplicate Count: 117
App Name: Age Calculator, Duplicate Count: 92
App Name: BMI Calculator, Duplicate Count: 78
App Name: Gallery, Duplicate Count: 75
App Name: Solitaire, Duplicate Count: 72
App Name: Sudoku, Duplicate Count: 68
App Name: 2048, Duplicate Count: 67
App Name: Compass, Duplicate Count: 66
App Name: Music Player, Duplicate Count: 60
App Name: Unit Converter, Duplicate Count: 57
App Name: Currency Converter, Duplicate Count: 53
App Name: Notes, Duplicate Count: 51
App Name: #NAME?, Duplicate Count: 49
App Name: Word Search, Duplicate Count: 48
App Name: Shopping List, Duplicate Count: 48
App Name: Animal Sounds, Duplicate Count: 47
App Name: Bubble Shooter, Duplicate Count: 46
App Name: Hangman, Duplicate Count: 46
App Name: Minesweeper, Duplicate Count: 45
App Name: Status Saver, Duplicate Count: 45


So I used this as an example of examining the __wrong__ variable, and to show that we need to dig deeper. I believe a good way to get continue is to cross-reference this with Developer IDs. 

In [9]:
# Identify duplicate app names and developer ids
duplicate_apps_devs = df[df.duplicated(subset=['App Name', 'Developer Id'], keep=False)]

# Count the number of occurrences for each duplicated 'App Name' and 'Developer Id'
duplicate_counts_app_dev = duplicate_apps_devs.groupby(['App Name', 'Developer Id']).size().reset_index(name='Duplicate Count')

# Display the title
print("Duplicates based on App Name and Developer Id:")

# Display the total number of duplicates
total_duplicates_app_dev = duplicate_counts_app_dev['Duplicate Count'].sum()
print(f"Total Number of Duplicates based on App Name and Developer Id: {total_duplicates_app_dev}\n")

# Display the 'App Name', 'Developer Id', and count of duplicates sorted by count in descending order
for index, row in duplicate_counts_app_dev.sort_values(by='Duplicate Count', ascending=False).iterrows():
    app_name = row['App Name']
    dev_id = row['Developer Id']
    count = row['Duplicate Count']
    print(f"App Name: {app_name}, Developer Id: {dev_id}, Duplicate Count: {count}")


Duplicates based on App Name and Developer Id:
Total Number of Duplicates based on App Name and Developer Id: 1825

App Name: Funny Memes Stickers for WhatsApp - WAStickerApps, Developer Id: Portugues CFS, Duplicate Count: 16
App Name: #NAME?, Developer Id: BVanStudio, Duplicate Count: 12
App Name: Guru Granth Sahib Ji (Audio), Developer Id: Nirbhau Apps, Duplicate Count: 12
App Name: Kalam(Poetry) Hazrat Maulvi Ghulam Rasool Alampuri, Developer Id: Alampuri Trust, Duplicate Count: 6
App Name: Nature Wallpapers, Developer Id: Legend APPS, Duplicate Count: 6
App Name: Football Clubs Logo Quiz, Developer Id: CogELor, Duplicate Count: 5
App Name: VZ | Exprésate Lector Unidad 3, Developer Id: Grupo Editorial Educar, Duplicate Count: 5
App Name: Happy New Year Photo Frame 2021, Developer Id: Photo frame intira, Duplicate Count: 4
App Name: 3C Legacy Icons - CPU Load, Developer Id: KevLeGik, Duplicate Count: 4
App Name: Basketball Wallpapers HD | 4K, Developer Id: dailylittle, Duplicate Coun

Now whether or not we trim this data is only important to the analysis you would want to do. You could do many things to this data that you may think would be the best path to analysis, yes we could "trim the fat" off the dataframe you plan to use for analysis, but you could argue that its presence can still contribute to an accurate analysis. Let's see if we can  