Until here, we have been using our own data, that's why they were all neat. But in data mining or data analysis, if you want to get a **result from a data**, you should first make a **pre-process**. For example, you want to make a data analysis on a Twitter hashtag, but a single mistyped word can brake your algorithm. You should make a pre-process on that data to make it **suitable for your algorithm**.

In [1]:
import numpy as np
import pandas as pd

In [4]:
arr = np.array([[10, 20, np.nan],[5, np.nan, np.nan],[21, np.nan, 10]]) # "np.nan" creates a missing or corrupted data

In [5]:
arr

array([[10., 20., nan],
       [ 5., nan, nan],
       [21., nan, 10.]])

In [6]:
df = pd.DataFrame(arr, index = ["Index1", "Index2", "Index3"], columns = ["Column1", "Column2", "Column3"] )

In [8]:
df # We created a dataframe includes NaN values, what if we want to delete them?

Unnamed: 0,Column1,Column2,Column3
Index1,10.0,20.0,
Index2,5.0,,
Index3,21.0,,10.0


In [11]:
df.dropna() # ".dropna()" deletes indexes that include values. When you give axis value as 0 it's checking Indexes
# For example, all indexses include a NaN value in this dataframe, so when run this statement it will delete everything.

Unnamed: 0,Column1,Column2,Column3


In [12]:
df.dropna(axis = 1) # When we give axis value as 1, it checks columns and does the same thing as above to columns that includes a NaN value

Unnamed: 0,Column1
Index1,10.0
Index2,5.0
Index3,21.0


### We can change the condition of deleting Indexes of Columns:
   • Lets say we want to keep the indexes include at least 2 normal value\
   • To do this, we will use **"thresh"** parameter in **".dropna()"**

In [15]:
df.dropna(thresh = 2) # "Index1" and "Index3" include at least 2 normal value

Unnamed: 0,Column1,Column2,Column3
Index1,10.0,20.0,
Index3,21.0,,10.0


### Deleting NaN values are not good sometimes. 
• You may want them to change "0" or "1" or the mean of that index etc.\
• To provide this situation we will use another function instead **.dropna()**

In [17]:
df.fillna(value = "1") # As you can see, all NaN values become "1"

Unnamed: 0,Column1,Column2,Column3
Index1,10.0,20.0,1.0
Index2,5.0,1.0,1.0
Index3,21.0,1.0,10.0


#### Let's say we want to make NaN values to the mean of all numbers:

In [21]:
df.sum() # Shows the sum of the values of all columns 

Column1    36.0
Column2    20.0
Column3    10.0
dtype: float64

In [22]:
df.sum().sum() # Sums up all values

66.0

In [26]:
df.size # Shows how many values dataframe has

9

In [28]:
df.isna() # Shows NaN values as True, normal values as False

Unnamed: 0,Column1,Column2,Column3
Index1,False,False,True
Index2,False,True,True
Index3,False,True,False


In [36]:
df.isnull().sum() # Checks every column and counts the number of NaN values

Column1    0
Column2    2
Column3    2
dtype: int64

In [38]:
df.isnull().sum().sum() # Way to show the NaN value number as an integer

4

In [39]:
df.isnull() # This is also the same as ".isna"

Unnamed: 0,Column1,Column2,Column3
Index1,False,False,True
Index2,False,True,True
Index3,False,True,False


In [44]:
def calculateMean(df):
    totalSum = df.sum().sum()
    totalNum = df.size - df.isna().sum().sum()
    return totalSum / totalNum

In [45]:
df.fillna(value = calculateMean(df))

Unnamed: 0,Column1,Column2,Column3
Index1,10.0,20.0,13.2
Index2,5.0,13.2,13.2
Index3,21.0,13.2,10.0


• As you can see all NaN values became **13.2**

In [47]:
df # We didn't update the dataframe because we didn't change "inplace". Let's do it

Unnamed: 0,Column1,Column2,Column3
Index1,10.0,20.0,
Index2,5.0,,
Index3,21.0,,10.0


In [48]:
df.fillna(value = calculateMean(df), inplace = True)

In [50]:
df # Now it's been updated. Nice!

Unnamed: 0,Column1,Column2,Column3
Index1,10.0,20.0,13.2
Index2,5.0,13.2,13.2
Index3,21.0,13.2,10.0
