# **Cleaning Data**

### Importing packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
!pip install opendatasets



You should consider upgrading via the 'f:\nordic internship\projects\dsbjun21\scripts\python.exe -m pip install --upgrade pip' command.


In [3]:
import opendatasets as od

In [None]:
od.download("https://www.kaggle.com/gauthamp10/google-playstore-apps")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: nithish1201
Your Kaggle Key: 

In [None]:
!dir

In [None]:
data = pd.read_csv('google-playstore-apps/Google-Playstore.csv')

In [None]:
data.shape

In [None]:
data.head(4)

In [None]:
data['Category'].value_counts()

In [None]:
len(data['Category'].value_counts())

In [None]:
data['Currency'].value_counts()

In [None]:
data = data[data['Currency'] == 'USD']
data.drop(['Currency'],1,inplace=True)
data = data.reset_index(drop=True)

Taking the data which has **Currency** as **USD** and dropping the **Currency** column

In [None]:
data.shape

### Dropping unnecessory columns

In [None]:
data.drop(['App Name','App Id','Developer Website','Developer Email','Privacy Policy','Scraped Time','Installs','Editors Choice','Rating','Rating Count'],1,inplace=True)
data.head()

- Analyse **Catagory**
- Change the data type of **Minimum Installs** and **Minimum Installs**
- Combine **Rating** and **Rating Count** as **Sum of Ratings**
- One hot encode the column **Free**
- Remove **M** in **Size** column and make it as float64
- Remove the text in **Minimum Android** and make it as float64 data type and rename it
- Check data types of **Released** and **Last Updated** 
- Analyse the column **Content Rating**
- One hot encode the columns --> **Ad Supported, In App Purchase, Editors**


In [None]:
data.columns

In [None]:
data['Content Rating'].value_counts()

Can apply **One Hot Encoding**

In [None]:
data['Size'] = data['Size'].str.replace('M','')
data['Size'] = data['Size'].str.replace(',','')
data.drop(data[data['Size'] == 'Varies with device'].index, inplace = True)
data.drop(data[data['Size'].str.contains("k") == True].index, inplace = True)

Dropping Unnecessory rows with respect to **Size** and replacing string values to null to make it as **float** data type

In [None]:
index_lst_G = data[data['Size'].str.contains("G") == True].index.to_list()
print(len(index_lst_G))

Checking for number of rows with GB sized apps.
- As it is less for is affordable to covert them into MB sized data

In [None]:
for i in index_lst_G:
    data.loc[i,'Size'] = data.loc[i,'Size'].replace('G','')
    val = float(data.loc[i,'Size'])*1000
    data.loc[i,'Size'] = str(val)

In [None]:
data['Size'] = data['Size'].astype(float)

Converting the data type of **Size**

In [None]:
data['Minimum Android'] = data['Minimum Android'].str.replace(' and up','')
data['Minimum Android']

Replacing Unnecessory Strings to empty String

In [None]:
data.loc[data['Minimum Android'].str.count('.') == 5.0,'Minimum Android'] = data[data['Minimum Android'].str.count('.') == 5.0]['Minimum Android'].str[:-2]

Trimming off the 3 Digit Versions to 2

In [None]:
data['Minimum Android'] = data['Minimum Android'].str.replace('W','')
print(data.shape)
data.drop(data[data['Minimum Android'] == 'Varies with device'].index, inplace = True)
data.shape

Dropping and replacing of unwanted strings

In [None]:
data[data['Minimum Android'].str.count('.') > 5.0]['Minimum Android']

In [None]:
data.drop(data[data['Minimum Android'].str.count('.') > 5.0]['Minimum Android'].index, inplace = True)

As the number of rows which has complex versions are less, we **Drop** them

In [None]:
data

In [None]:
data['Minimum Android'] = data['Minimum Android'].astype(float)

In [None]:
data['Minimum Installs'] = data['Minimum Installs'].astype(int)

In [None]:
data['Released'] = pd.to_datetime(data['Released'])
data['Last Updated'] = pd.to_datetime(data['Last Updated'])

Changing of data types

In [None]:
data.dtypes

In [None]:
data.head()

In [None]:
data.isna().sum()

Finally checking and Dropping Null values

In [None]:
data.shape

In [None]:
data.dropna(axis = 0, inplace = True)
data.shape

In [None]:
for i in ["Category","Content Rating","Developer Id"]:
    print(data[data[i] == ''].index.to_list())

In [None]:
for i in ["Size","Minimum Android"]:
    print(data[data[i] == 0.0].index.to_list())

Checking for **Hidden Null values**

In [None]:
data

In [None]:
data.rename(columns = {'Rating Count':'Rating_Count',
                       'Minimum Installs':'Min_Downloads',
                       'Maximum Installs':'Max_Downloads',
                       'Minimum Android':'Min_Version',
                       'Last Updated':'Last_Updated',
                       'Content Rating':'Content',
                       'Ad Supported':'Ad_Support',
                       'In App Purchases':'Inapp_Purchase',
                       'Developer Id':'DeveloperId'}, inplace = True)

In [None]:
from datetime import date
today = date.today()
today = pd.to_datetime(today)
today

In [None]:
data['Released_Before'] = ((today - data.Released)/np.timedelta64(1, 'M'))
data['Updated_Before'] = ((today - data.Last_Updated)/np.timedelta64(1, 'M'))
data['Released_Before'] = data['Released_Before'].astype(int)
data['Updated_Before'] = data['Updated_Before'].astype(int)

In [None]:
data

In [None]:
data.drop(['Released','Last_Updated'],1,inplace=True)
data

In [None]:
data = data[['Category', 'Free', 'Price',
       'Size', 'Min_Version', 'DeveloperId', 'Released_Before', 'Updated_Before',
       'Content', 'Ad_Support', 'Inapp_Purchase' , 'Min_Downloads' ,'Max_Downloads']]
data

In [None]:
data.columns

Renaming and Reordering **Column Names**

#### Making a meaningful **feature** from ```Min_Downloads``` and ```Max_Downloads``` to make it continuous

In [None]:
data[data["Max_Downloads"] == 0].shape

Checking for zeros in ```Max_downloads```

In [None]:
print(data.shape)
data.drop(data[data["Max_Downloads"] == 0].index,inplace=True)
data.reset_index(drop=True,inplace=True)
data.shape

we drop them because it may cause *Non-definite* values when divide ```0``` with ```2```

In [None]:
data['Average_Downloads'] = data['Min_Downloads'] +((data['Max_Downloads'] - data['Min_Downloads'])/2)
data.drop(['Max_Downloads','DeveloperId'],1,inplace=True)

In [None]:
data.reset_index(drop=True,inplace=True)
data

#### making the newly created columns to **```integer```**

In [None]:
data["Average_Downloads"] = data['Average_Downloads'].apply(np.int64)
data.head()

In [None]:
data.describe(include = [np.number])

In [None]:
data.describe(include = ['O'])

In [None]:
data.to_csv('Playstore-data.csv.gz',compression='gzip')