## DATA ANALYTICS FUNDAMENTALS
This workbook demonstrates the usage of data analytics using python.

In [1]:
import pandas as pd
import numpy as np

In [2]:
import requests
import matplotlib.pyplot as plt



Obtain dataframe for **GOOGLE PLAYSTORE INFORMATION**.

In [3]:
url="https://raw.githubusercontent.com/AshishJangra27/Data-Analysis-with-Python-GFG/refs/heads/main/2.%20Dataset%20Walkthrough/googleplaystore.csv"
df=pd.read_csv(url)

In [4]:
type(df)

pandas.core.frame.DataFrame

Let us observe the first few entries of the dataframe to check how it looks like.

In [5]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [6]:
df.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')

#### CHECK FOR NULL VALUES IN THE DATASET
Now we look into the dataset to check for presence of null/NaN values in the dataset and if so replace it with a suitable value

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [8]:
df.isnull().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

As can be observed there exists quite a bit of entries with Null values.
Now there can be a couple of techniques one adopts to deal with null values namely removing entries or replacing null values with suitable ones.

#### OMITTING NULL ENTRIES

In [9]:
df_nullvaluesdropped=df.dropna()
df_nullvaluesdropped.reset_index(inplace=True)

In [10]:
df_nullvaluesdropped.head()

Unnamed: 0,index,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [11]:
df_nullvaluesdropped.isnull().sum()

index             0
App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              0
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       0
Android Ver       0
dtype: int64

#### REPLACE NULL ENTRIES WITH SUITABLE VALUES
Here we replace null values with suitable entries. Say we fill the **RATING** column with its avg value & the rest with the ones that has the maximum occurances.

##### REPLACE RATING WITH AVG VALUE

In [12]:
def compute_sum(num1,num2):
    s=0

    if pd.isna(num1):
        num1=0
    if pd.isna(num2):
        num2=0
    return s+num1+num2

In [13]:
from functools import reduce

In [14]:
avg_rating=reduce(compute_sum,list(df.Rating))/len(list(df.Rating))
print(f"The average rating is {avg_rating}")

The average rating is 3.6231897426436723


##### REPLACE NULL TYPE, CONTENT RATING & ANDROID VERSION WITH THE MOST COMMON TYPE , CONTENT RATING AND VERSION RESPECTIVELY

In [15]:
d=dict(df.Type.value_counts())

In [16]:
def Find_common_occurance(d):
    max_occurances=-1

    for type in d:
        if d[type]>max_occurances:
            max_occurances=d[type]
            max_type=""
            max_type+=type
    return max_type

In [17]:
max_type = Find_common_occurance(d)
print(f"The most common type of app on google playstore is {max_type}")

The most common type of app on google playstore is Free


In [18]:
content_rating_counts=dict(df['Content Rating'].value_counts())
max_contentrating = Find_common_occurance(content_rating_counts)

print(f"The most common andriod apps on playstore is rated: {max_contentrating}")

The most common andriod apps on playstore is rated: Everyone


In [19]:
andriod_ver_count=dict(df['Android Ver'].value_counts())
max_andriodver = Find_common_occurance(andriod_ver_count)

print(f"The most prevalent android version on playstore is: {max_andriodver}")

The most prevalent android version on playstore is: 4.1 and up


In [20]:
curr_ver_count=dict(df['Current Ver'].value_counts())
max_currver = Find_common_occurance(curr_ver_count)

print(f"The most common current version of app found on playstore is: {max_currver}")

The most common current version of app found on playstore is: Varies with device


#### REPLACE NULL TYPES WITH RESPECTIVE VALUES

In [21]:
values_to_replace={"Type":max_type,"Content Rating":max_contentrating,"Current Ver":max_currver,"Android Ver":max_andriodver,"Rating":avg_rating}
df_nullvaluesreplcaed = df.fillna(value=values_to_replace)

In [22]:
df_nullvaluesreplcaed.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [23]:
df_nullvaluesreplcaed.isnull().sum()

App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              0
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       0
Android Ver       0
dtype: int64

### IMPUTERS
Here we demonstrate the usage of **IMPUTERS** from sklearn package to help replace missing values along each column using **mean,median,mode** or other techniques.

In [24]:
from sklearn.impute import SimpleImputer

### MEAN IMPUTER
Here we use a mean imputer to find the avg value across columns with numerical variables and replace NA values with the mean value obtained.

In [25]:
df_nullsimputed = df.copy(deep=True)

In [26]:
mean_imputer = SimpleImputer(missing_values=np.nan,strategy="mean")

In [27]:
mean_imputer.fit(df_nullsimputed.iloc[:,2:3].values)

In [28]:
x = mean_imputer.transform(df_nullsimputed.iloc[:,2:3].values)

In [29]:
df_nullsimputed.iloc[:,2:3] = x

In [30]:
df_nullsimputed['Rating'].isnull().values.any()

np.False_

In [31]:
mostfreq_imputer=SimpleImputer(missing_values=np.nan,strategy="most_frequent")

In [32]:
mostfreq_imputer.fit(df_nullsimputed.iloc[:,6:].values)

In [33]:
x = mostfreq_imputer.transform(df_nullsimputed.iloc[:,6:].values)

In [34]:
df_nullsimputed.iloc[:,6:] = x

In [35]:
df_nullsimputed.isnull().values.any()

np.False_

#### FINDING OUT NUMBER OF FREE AND PAID APPS IN EVERY CATEGORY

In [36]:
categories = list(df_nullsimputed['Category'].unique())
types = list(df['Type'].unique())

In [37]:
def Find_num_types_in_every_category():
    d={}

    for i in range(len(df_nullsimputed)):
        if df_nullsimputed.iloc[i]['Category'] not in d:
            d[df_nullsimputed.iloc[i]['Category']]={}
        if df_nullsimputed.iloc[i]['Type'] not in d[df_nullsimputed.iloc[i]['Category']]:
            d[df_nullsimputed.iloc[i]['Category']][df_nullsimputed.iloc[i]['Type']]=0

        d[df_nullsimputed.iloc[i]['Category']][df_nullsimputed.iloc[i]['Type']]+=1
    return d

In [38]:
Categorywise_apptypes = Find_num_types_in_every_category()

In [39]:
Categorywise_apptypes

{'ART_AND_DESIGN': {'Free': 62, 'Paid': 3},
 'AUTO_AND_VEHICLES': {'Free': 82, 'Paid': 3},
 'BEAUTY': {'Free': 53},
 'BOOKS_AND_REFERENCE': {'Free': 203, 'Paid': 28},
 'BUSINESS': {'Free': 446, 'Paid': 14},
 'COMICS': {'Free': 60},
 'COMMUNICATION': {'Free': 360, 'Paid': 27},
 'DATING': {'Paid': 7, 'Free': 227},
 'EDUCATION': {'Free': 152, 'Paid': 4},
 'ENTERTAINMENT': {'Free': 147, 'Paid': 2},
 'EVENTS': {'Free': 63, 'Paid': 1},
 'FINANCE': {'Free': 349, 'Paid': 17},
 'FOOD_AND_DRINK': {'Free': 125, 'Paid': 2},
 'HEALTH_AND_FITNESS': {'Free': 325, 'Paid': 16},
 'HOUSE_AND_HOME': {'Free': 88},
 'LIBRARIES_AND_DEMO': {'Free': 84, 'Paid': 1},
 'LIFESTYLE': {'Free': 363, 'Paid': 19},
 'GAME': {'Free': 1061, 'Paid': 83},
 'FAMILY': {'Free': 1781, 'Paid': 191},
 'MEDICAL': {'Paid': 109, 'Free': 354},
 'SOCIAL': {'Free': 292, 'Paid': 3},
 'SHOPPING': {'Free': 258, 'Paid': 2},
 'PHOTOGRAPHY': {'Free': 313, 'Paid': 22},
 'SPORTS': {'Free': 360, 'Paid': 24},
 'TRAVEL_AND_LOCAL': {'Free': 246, '

#### LISTING FREE APPS WITH RATING MORE THAN 4.5 BELONGING TO FAMILY CATEGORY

In [40]:
categories

['ART_AND_DESIGN',
 'AUTO_AND_VEHICLES',
 'BEAUTY',
 'BOOKS_AND_REFERENCE',
 'BUSINESS',
 'COMICS',
 'COMMUNICATION',
 'DATING',
 'EDUCATION',
 'ENTERTAINMENT',
 'EVENTS',
 'FINANCE',
 'FOOD_AND_DRINK',
 'HEALTH_AND_FITNESS',
 'HOUSE_AND_HOME',
 'LIBRARIES_AND_DEMO',
 'LIFESTYLE',
 'GAME',
 'FAMILY',
 'MEDICAL',
 'SOCIAL',
 'SHOPPING',
 'PHOTOGRAPHY',
 'SPORTS',
 'TRAVEL_AND_LOCAL',
 'TOOLS',
 'PERSONALIZATION',
 'PRODUCTIVITY',
 'PARENTING',
 'WEATHER',
 'VIDEO_PLAYERS',
 'NEWS_AND_MAGAZINES',
 'MAPS_AND_NAVIGATION',
 '1.9']

In [41]:
res = list(df_nullsimputed[(df_nullsimputed['Category'].isin(['FAMILY'])) & (df_nullsimputed['Type'].isin(['Free']))]['App'].unique())

In [42]:
print(f"There are {len(res)} number of free applications that come under FAMILY category")

There are 1724 number of free applications that come under FAMILY category


In [43]:
Premium_Free_Family_apps_list = list(df_nullsimputed[(df_nullsimputed['Category'].isin(['FAMILY'])) & (df_nullsimputed['Type'].isin(['Free'])) & (df_nullsimputed['Rating']>4.5)]['App'].unique())

In [44]:
print(f"There exists {len(Premium_Free_Family_apps_list)} free apps that belong to FAMILY category rated more than 4.5")

There exists 309 free apps that belong to FAMILY category rated more than 4.5


#### LIST APPS WITH CATEGORY FAMILY AND RATED MORE THAN 4.5 IN ASCENDING ORDER BASIS RATING
Here we make use of the **Groupby** function in pandas as is used in SQL.

In [45]:
df_nullsimputed.groupby(by=['Category'])['Rating'].mean().sort_values(ascending=False)

Category
1.9                    19.000000
EDUCATION               4.387778
EVENTS                  4.363647
ART_AND_DESIGN          4.350462
BOOKS_AND_REFERENCE     4.311026
PERSONALIZATION         4.307603
GAME                    4.282506
PARENTING               4.282223
HEALTH_AND_FITNESS      4.266296
BEAUTY                  4.260882
SHOPPING                4.254052
SOCIAL                  4.248001
WEATHER                 4.239675
SPORTS                  4.218404
PRODUCTIVITY            4.208287
HOUSE_AND_HOME          4.196819
FAMILY                  4.192394
PHOTOGRAPHY             4.192179
AUTO_AND_VEHICLES       4.190824
MEDICAL                 4.190167
LIBRARIES_AND_DEMO      4.181962
FOOD_AND_DRINK          4.170709
COMMUNICATION           4.163842
COMICS                  4.156445
BUSINESS                4.145987
NEWS_AND_MAGAZINES      4.142993
FINANCE                 4.139108
ENTERTAINMENT           4.126174
TRAVEL_AND_LOCAL        4.119716
LIFESTYLE               4.112427
V

#### FIND NUMBER OF APPS IN EACH CATEGORY BASIS TYPES

In [46]:
df_nullsimputed.groupby(by=['Category'])['Type'].count().sort_values(ascending=False)

Category
FAMILY                 1972
GAME                   1144
TOOLS                   843
MEDICAL                 463
BUSINESS                460
PRODUCTIVITY            424
PERSONALIZATION         392
COMMUNICATION           387
SPORTS                  384
LIFESTYLE               382
FINANCE                 366
HEALTH_AND_FITNESS      341
PHOTOGRAPHY             335
SOCIAL                  295
NEWS_AND_MAGAZINES      283
SHOPPING                260
TRAVEL_AND_LOCAL        258
DATING                  234
BOOKS_AND_REFERENCE     231
VIDEO_PLAYERS           175
EDUCATION               156
ENTERTAINMENT           149
MAPS_AND_NAVIGATION     137
FOOD_AND_DRINK          127
HOUSE_AND_HOME           88
LIBRARIES_AND_DEMO       85
AUTO_AND_VEHICLES        85
WEATHER                  82
ART_AND_DESIGN           65
EVENTS                   64
PARENTING                60
COMICS                   60
BEAUTY                   53
1.9                       1
Name: Type, dtype: int64