<div class="alert alert-block alert-success">
    <h1 align="center">10 Steps to do when conducting your Exploratory Data Analysis (EDA)</h1>
    <h4 align="center"><a href="https://mohammadamirifard.com/">Mohammad Amirifard</a></h5>
</div>

<img src = "https://miro.medium.com/proxy/1*Owa2rsDG6Rwv1IM_RdsL3A.gif">

### Detail:
In this program you learn how to do eda on your data

EDA: Every time before we start analysing data in any method either manual or using computing tools, we always need to check and understand the data that we have. This intends to find out whether the data is sufficient or ready enough to proceed to the analytical process. As we already know, data is not always clean and ready to use. There are a lot of data in this world that is still messy, inconsistent, having many missing and duplicate values, imbalanced data, and many more.So, some of you might already deal with these issues and understand enough that data is not always clean like Kaggle Dataset, especially real-world data. Thus, Exploratory Data Analysis (EDA) needs to be done beforehand

<div class="alert alert-block alert-success">
    <h1 align="center">Let's get started</h1>
    <h2 align="center">Step1: Import libararies and load dataset</h2>
</div> 

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

filepath = r'googleplaystore.csv'
data = pd.read_csv(filepath)
data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


<div class="alert alert-block alert-success">
    <h2 align="center">Step2: Gather useful information of the data</h2>
</div> 

In [2]:
rows, columns = data.shape
data.info()
print(f'\nThe number of rows is    :{rows}\nThe number of columns is :{columns}')
print(f'\nThe title of columns : \n{data.columns}')
print(f'\nIndex list:\n{data.index}')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB

The number of rows is    :10841
The number of columns is :13

The title of columns : 
Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
 

In [3]:
# Find unique items in each column
data.nunique()

App               9660
Category            34
Rating              40
Reviews           6002
Size               462
Installs            22
Type                 3
Price               93
Content Rating       6
Genres             120
Last Updated      1378
Current Ver       2832
Android Ver         33
dtype: int64

<div class="alert alert-block alert-success">
    <h2 align="center">Step3: Change the type of each column to a good type like int,flaot,...</h2>
</div> 

In [4]:
# Change type of column "Reviews" from object to int32
# When you run this code : data.Reviews = data.Reviews.astype('int32') , you understand that the code is intteruped beacuse of
# 3.0M item. So to solve this item, firstly we check that which index includes this item. So :
for i, item in enumerate(data.Reviews):
   try:
      int(item)
   except ValueError:
      print('The index ,including error, is {} with a value of {!r}'.format(i, item))


The index ,including error, is 10472 with a value of '3.0M'


In [5]:
# Now we must check this row to solve the error.
# After checking it, you undestand that this row must be shifted one step to the right. So:
data.iloc[10472] = data.iloc[10472].shift(periods=1)

In [6]:
# Now we can change the type of it:
data.Reviews = data.Reviews.astype('int32')

In [7]:
# Change type of columns ["App","Type","Content Rating","Category","Genres", "Current Ver", "Android Ver"] from objects
# to categories
for i in ["App","Type","Content Rating","Category","Genres", "Current Ver", "Android Ver"]:
    data[i] = data[i].astype('category')

In [8]:
# Change type of column "Rating" from obeject to float16 (for minimizing the useage of memory)
data.Rating = data.Rating.astype('float16')

In [9]:
# Change type of column "Size" from object to float16. But firstly we must remove letters 'M' and "k" form the items and multiply
# them by 1000 and 10000, respectively. So we define a function to use it in data.apply() code

# Our function:
def items_to_float_values(x):
    if type(x) == float or type(x) == int:
        return x
    if 'k' in x:
        if len(x) > 1:
            return float(x.replace('k', '')) * 1000
        return 1000.0
    if 'M' in x:
        if len(x) > 1:
            return float(x.replace('M', '')) * 1000000
        return 1000000.0
    return None

data.Size = data.Size.apply(items_to_float_values)

# So far we have turned items to float numbers and eliminated 'M' and 'k' form items
# Take a look at them:
print('Type of column "Size" has changed to :',data.Size.dtype)
data.iloc[:,[4]]

Type of column "Size" has changed to : float64


Unnamed: 0,Size
0,19000000.0
1,14000000.0
2,8700000.0
3,25000000.0
4,2800000.0
...,...
10836,53000000.0
10837,3600000.0
10838,9500000.0
10839,


In [10]:
# Change type of column "Installs" from object to int64. But firstly we must remove string '+' from items. Then remove comma
# form items and finally change the type:

data= data.replace(to_replace={'Installs':'[+]'},value={'Installs':''},regex=True)
data.Installs = data.Installs.str.replace(',', '').astype('int64')

In [11]:
# Change type of column 'Price' from obejct flaot16. But first, we must remove $ from items
data.Price = data.Price.str.replace('$', '', regex=True).astype('float16')

In [12]:
# Change type of column "Last Updated" from object to time
data['Last Updated'] = pd.to_datetime(data['Last Updated'])

In [None]:
# So far, we have successfully changed all old types to new ones. Therefor, take a look at the information of data again

In [14]:
data.info()
print('\nCongratulations. You have successfully changed the types')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   App             10840 non-null  category      
 1   Category        10841 non-null  category      
 2   Rating          9367 non-null   float16       
 3   Reviews         10841 non-null  int32         
 4   Size            9146 non-null   float64       
 5   Installs        10841 non-null  int64         
 6   Type            10840 non-null  category      
 7   Price           10841 non-null  float16       
 8   Content Rating  10841 non-null  category      
 9   Genres          10840 non-null  category      
 10  Last Updated    10841 non-null  datetime64[ns]
 11  Current Ver     10833 non-null  category      
 12  Android Ver     10839 non-null  category      
dtypes: category(7), datetime64[ns](1), float16(2), float64(1), int32(1), int64(1)
memory usage: 862.3 KB

Cong

<div class="alert alert-block alert-success">
    <h2 align="center">Step4: Fill missing values </h2>
</div> 

In [15]:
# at the begining we check the number of missed values in each column
data.isnull().sum()

App                  1
Category             0
Rating            1474
Reviews              0
Size              1695
Installs             0
Type                 1
Price                0
Content Rating       0
Genres               1
Last Updated         0
Current Ver          8
Android Ver          2
dtype: int64

In [36]:
# Now we must chech all missing values and fill the them, if it is necessary
# Start with column "App". We must find where the value has been missed. S:
index_of_null_value_in_App_column= np.where(data.App.isnull())[0]
data.iloc[index_of_null_value_in_App_column]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,,Life Made WI-Fi Touchscreen Photo Frame,1.900391,19,3000000.0,1000,Free,0.0,Everyone,,2018-02-11,1.0.19,4.0 and up


In [37]:
# Wow. There is  actually a NaN value in column "APP"