### **ABOUT DATASET:**
- **DESCRIPTION:**\
Extract data set from KAggle link(https://www.kaggle.com/datasets/lava18/google-play-store-apps?resource=download)

- **CONTEXT:**\
While many public datasets (on Kaggle and the like) provide Apple App Store data, there are not many counterpart datasets available for Google Play Store apps anywhere on the web. On digging deeper, I found out that iTunes App Store page deploys a nicely indexed appendix-like structure to allow for simple and easy web scraping. On the other hand, Google Play Store uses sophisticated modern-day techniques (like dynamic page load) using JQuery making scraping more challenging.

- **Content:**\
Each app (row) has values for catergory, rating, size, and more.

- **Acknowledgements:**\
This information is scraped from the Google Play Store. This app information would not be available without it.

- **Inspiration:**\
The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market!

# IMPORT LIBRARIES:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Data loading  and preprocessing(exploration and cleaning):
1. load  the dataset from a local file or web source using pandas library's read_csv() function.
2. creating  exploratory data analysis (EDA) to understand the structure, features, distributions of the data. This can be done by calling functions like df.head(),df.describe(),df. The data should be stored in a variable named 'data'.
3. dealing with  missing values: use the fillna() method in pandas to replace all NaN values with an appropriate value.

In [None]:
df = pd.read_csv('googleplaystore.csv')

In [None]:
# reviewing first row
df.head(10)

In [None]:
# reviewing full output of rows and columns
pd.set_option('display.max_column',None)
pd.set_option('display.max_row',None)

In [None]:
# hide warning on runtime
import warnings
warnings.filterwarnings('ignore')

In [None]:
# columns
df.columns

In [None]:
#shape of dataset
df.shape

In [None]:
df.info()

# observation:
1. There are 10841 rows and 13 columns in datasets.
2. The columns are different data types.
3. The  dataset is not balanced, there are both categorical (with missing values) and numerical variables.
   

In [None]:
df.describe()

In [None]:
# clean 'Size','Price', 'installs' and 'review'
df['Size'].isnull().sum()

no null value lets move forward.

In [None]:
# check unique value
df['Size'].unique

In [None]:
# converting mb and kb into bytes
df['Size'].loc[df['Size'].str.contains('M')].value_counts().sum()


In [None]:
df['Size'].loc[df['Size'].str.contains('k')].value_counts().sum()

In [None]:
df['Size'].loc[df['Size'].str.contains('Varies with device')].value_counts().sum()

In [None]:
df['Size'].value_counts().sum()

In [None]:
8830+316+1695

We have 8830 values in 'M' units.\
We have 316 values in'k' units.\
We have 316 values in'varies with devices' units.

#### lets convert values in numerics

In [None]:
# Convert the Size column to numeric by multiplying the values with 1024 if it has 'k' in it and 1024*1024 if it has 'm'.
# This function will convert the size column to numeric.
def convert_size(Size):
    """ this function will convert the size column to numeric by multiplying the values with 1024 if it has 'k' in it and 1024*1024 if it has 'M' in it"""
    if isinstance(Size,str):
        if 'k' in Size:
            return float (Size.replace('k',''))*1024
        elif 'M'in Size:
            return float (Size.replace('M',''))*1024*1024
        elif 'Varies with devices' in Size: 
            return np.nan
    return Size
df['Size']=df['Size'].apply(convert_size)

In [None]:
#rename the column name 'size' to 'size_in_byte'
df.rename(columns={'Size': 'Size_in_bytes'}, inplace=True)

lets have look on installs columns

In [None]:
# unique value in installs column
df['Installs']. unique()

In [None]:
#lets have a value counts
df['Installs'].value_counts()

In [None]:
# '+' in values
df['Installs'].loc[df['Installs'].str.contains('/+')].value_counts().sum()

In [None]:
# totals values in installs columns
df['Installs'].value_counts().sum()

In [None]:
# remove '+' in values
df['Installs']=df['Installs'].apply(lambda x : x.replace('+', '')if '+' in str(x) else x)

In [None]:
df['Installs'].dtypes

In [None]:
df['Installs'].value_counts()

In [None]:
# remove ',' in values
df['Installs']=df['Installs'].apply(lambda x : x.replace(',', "")if ',' in str(x) else x)

In [None]:
df.head()

In [None]:
df['Installs']=df['Installs'].apply(lambda x: int(x))

In [None]:
# making a new column called 'Installs_category' which will have the category of the installs
bins = [-1, 0, 10, 1000, 10000, 100000, 1000000, 10000000, 10000000000]
labels=['no', 'Very low', 'Low', 'Moderate', 'More than moderate', 'High', 'Very High', 'Top Notch']
df['Installs_category'] = pd.cut(df['Installs'], bins=bins, labels=labels)

In [None]:
df['Installs_category'].value_counts()

lets have a look on price column

In [None]:


df['Price'].isnull().sum()

In [None]:
df['Price'].value_counts()

In [None]:
df['Price'].unique()

In [None]:
# count having $ sign
df['Price'].loc[df['Price'].str.contains('\$')].value_counts().sum()

In [None]:
df['Price'].loc[(df['Price'].str.contains('0')) & (~df['Price'].str.contains('\$'))].value_counts().sum()

In [None]:
# remove $ sign
df['Price']=df['Price'].apply(lambda x : x.replace('$', "")if '$' in str(x) else x)


In [None]:
df['Price']=df['Price'].apply(lambda x: float(x))

In [None]:
df['Price'].dtype

In [None]:
df['Price'].value_counts()