<center> Business Analytics & Business Intelligence  <br>
Deepak KC, deepak.kc@hamk.fi, 2020 <br> Data Preprocessing  </center>

# Steps for Data Preprocessing 

- The first important task in any data analysis is an exploration of the data. - This chapter focuses on data cleaning and other important steps that should be considered when working with a new data set. 

## Importing Required Libraries

In [None]:
# Step 1: importing required modules
import pandas as pd
import numpy as np

## Import the data set

In [None]:
#Note: to check the current working directory, please type pwd
%pwd
#you may either write the complete path for the data set or save it in the same directory where your notebook file is
# to change your working directory (if needed) use cd 
%cd Data_Cleaning

# reading your data
data = pd.read_csv("covid_finland_cleaning.csv")


In [None]:
# Prview the first 5 lines of the loaded data
data.head()

In [None]:
# Lets look into a random dataset of data
sample=data.sample(5)
sample

In [None]:
#know your data set, run following lines in different cells
data.columns 
data.index
data.shape #(rows, columns)
data.values
data.info()
# to get only a specific column
data['geoId']
# to create a subset with required columns
data[['geoId','year']]

#to access the data (row )based on the location (index) 
data.loc[3]

#to access mulitple rows based on the location (index) 
data.loc[[3,4, 7]]

# to access multiple rows with some conditions such as in a specicific year or for a specific country
data.loc[(data['year'] == 2020)]

data.loc[(data['deaths'] >= 1)]

## Explore the variables

- You need to know how many variables are there in a data set, the data types of the variables and the range of values they take on. 

In [None]:
data.dtypes

# Let us check if our dataset contains missing values. 

When we randomly went through different subsets of data, we already know that our dataset has some missing values which are marked as N/A. 


In [None]:
## lets find out how many datas are missing per column
missing_count = data.isnull().sum()
print(missing_count)

In [None]:
# #We can see that 7 missing values in cases
#When working with huge dataset it may be helpful to see what
#percentage of the data is missing 

total_data=np.product(data.shape)
total_missing=missing_count.sum()
per_of_missing_data=(total_missing/total_data)* 100
print(per_of_missing_data)

#Conclusion: as you can see we have a very small percentage
#of missing data, we may delete those rows or fill them up 


# Handling Null Values: 

## Dropna :  dropna() method allows to analyze and drop Rows/Columns with Null values in different ways

If you have enough samples in the data set, you may delete a particular row that has a null value and a particular column. This has to be done catutiously as deleting the data leads to the loss of information and the result may not be efficient.

Drop the columns where all elements are nan:

>> data.dropna(axis=1, how='all')

Drop the columns where any of the elements is nan
>> data.dropna(axis=1, how='any')

Drop the rows where all of the elements are nan
>> data.dropna(axis=0, how='all')

Keep only the rows with at least 2 non-na values:

>> data.dropna(thresh=2)

# checking the column cases for null values 
>> data["cases"].isnull().values.sum()

# change the cases fron NaN to 0
>> data["cases"].fillna(0, inplace=True)

# checking the different types of values in the column cases
>> data["cases"].unique()

# removing all rows that contain a missing value
>> data.dropna()

Read More on dropna: https://pandas.pydata.org/pandas-docs/version/0.21.1/generated/pandas.DataFrame.dropna.html

In [None]:
data["geoId"].unique()

In [None]:
# lets create a new data frame by dropping all null values 
data1=data.dropna(axis=0, how='any')

# Fill NA/NAN : Filling in missing values automatically

The Panda's fillna() function may be used to fill in missing values in a dataframe. We have to specify what we want to do with the NaN values. We may replace with 0, or replace with the value that comes directly after it in the same column and other remaining one with 0s.



```


Replace all NaN elements with 0s
>> data.fillna(0) 

Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
>> data.fillna(value=values)

Only replace the first NaN element
>>> data.fillna(value=values, limit=1)

Replace all NaN's with the value that comes directly after it in the same column, 
then replace all the remaining na's with 0

>>> df.fillna(method='bfill', axis=0).fillna(0)
# This is formatted as code
```
Read More at: https://pandas.pydata.org/pandas-docs/version/0.21.1/generated/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna

In [None]:
#fill na/nan examples
df2 = pd.DataFrame([[np.nan, 2, np.nan, 0],

                   [3, 4, np.nan, 1],

                   [np.nan, np.nan, np.nan, 5],

                   [np.nan, 3, np.nan, 4]],

                  columns=list('ABCD'))
df2

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


In [None]:
# Replace all NaN elements with 0s
df3 = df2.fillna(0)
df3

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,0.0,0.0,0.0,5
3,0.0,3.0,0.0,4


In [None]:
# Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
df4 = df2.fillna(value=values)
df4

Unnamed: 0,A,B,C,D
0,0.0,2.0,2.0,0
1,3.0,4.0,2.0,1
2,0.0,1.0,2.0,5
3,0.0,3.0,2.0,4


In [None]:
# Only replace the first NaN element
df5 = df2.fillna(value=values, limit=1)
df5

Unnamed: 0,A,B,C,D
0,0.0,2.0,2.0,0
1,3.0,4.0,,1
2,,1.0,,5
3,,3.0,,4


In [None]:
#Replace all NaN's with the value that comes directly after it in the same column, 
# then replace all the remaining na's with 0

df6 = df2.fillna(method='bfill', axis=0).fillna(0)
df6

Unnamed: 0,A,B,C,D
0,3.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,0.0,3.0,0.0,5
3,0.0,3.0,0.0,4


# Data transformation
- It is the final step in data preprocessing 
- Transfering the date into the appropriate form for data modeling 
- Strategies for data transformation 
    - Smoothing: process to remove noise from the dataset by using some algorithms
    - Attribute/feature construction: Creating new attributes to assist the mining process 
    - Aggregation: Method of storing and presenting data in a summary format
    - Normalization: Converting data variable into a given range
    - Discretization: Process of transforming continuous data into set of small intervals
    - Generalization: converting low-level data into high level data attributes such as converting ages 22,20,25 into categorical value (young, old ..)

### Dulpicate Values: 
- One of the import task in analyzing data is to identify and remove duplicate values. 
- Panda has drop_duplicates() method for removing duplicate values for the data frame. 
dataframe.drop_duplicates() 
- Syntax
    - dataframe.drop_duplicates(parameters)
    - Parameters
        - subset: column label or sequence of labels, optional
            It is better to consider certain columns for identifying duplicates
            By default it used all of the columns. 
        - keep: {‘first’, ‘last’, False}, default ‘first’
            To determine which duplicates to keep 
               -first: will drop all duplicates except for the first occurence.
                -last: drops all except for the last occurence
                -flase: drop all duplicates
        - inplace: bool, default false
            whether to drop duplicates in place or to return a copy
        - for a dataframe if inplace=true it will return dataframe with dulpicates removed

In [None]:
# Examples Removing duplicate values 

df = pd.DataFrame ({
    'name': ['Johanna'] * 3 + ['Pekka'] * 2 + ['Andrew', 'Ben', 'Matthew'],
    'courseName': ['Data Processing'] * 8,
    'grade': [1,1,1,3,3,5,5,4]
})
df

Unnamed: 0,name,courseName,grade
0,Johanna,Data Processing,1
1,Johanna,Data Processing,1
2,Johanna,Data Processing,1
3,Pekka,Data Processing,3
4,Pekka,Data Processing,3
5,Andrew,Data Processing,5
6,Ben,Data Processing,5
7,Matthew,Data Processing,4


In [None]:
#Pandas duplicated() method returns the boolean series
df.duplicated()

0    False
1     True
2     True
3    False
4     True
5    False
6    False
7    False
dtype: bool

In [None]:
#by default drop_duplicates() method removes duplicate rows based on all columns
df.drop_duplicates()

Unnamed: 0,name,courseName,grade
0,Johanna,Data Processing,1
3,Pekka,Data Processing,3
5,Andrew,Data Processing,5
6,Ben,Data Processing,5
7,Matthew,Data Processing,4


In [None]:
#to remove duplicates on specific column(s)
df.drop_duplicates(subset=['courseName'])

Unnamed: 0,name,courseName,grade
0,Johanna,Data Processing,1


In [None]:
# to remove duplicates and keep the first occurence
df.drop_duplicates(subset=['name','courseName'], keep='first')

In [None]:
# to remove duplicates and keep the last occurence
df.drop_duplicates(subset=['name','courseName'], keep='last')

In [None]:
# dropping all duplicate values that is rows having same name
#and coursename are removed to create a new data frame,
#since keep=false it will remove all occurences of duplicate values
df.drop_duplicates(subset=['name','courseName'], keep=False, inplace=True)
df
#check the output by recreating the dataframe and changing inplace=False

### Replacing values data.replace 
pandas.DataFrame.replace
- Replace method provides a simple way of replacing values in a dataframe. 
- Syntax: 
    df.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad'
- Parameters: to_replacestr, regex, list, dict, Series, int, float, or None

Read More: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html?highlight=data%20replace#pandas.DataFrame.replace

In [None]:
# replacing values - Examples
df1 = pd.Series([0, 1, 2, 3, 4])
#to replace 0 with 4
df1.replace(0,4)

df2 = pd.Series([5, -9999,20, -9999., -4000, 7])
# in the above dataframe, lets try to replace -9999 with NA values 
df2.replace(-9999, np.nan, inplace=True)

#replacing multiple instances "List-like"
df3.replace([0,1,2,3,4],4) #replaces all numbers with 4

In [None]:
#lets create a new dataframe with 3 columns A, B & C
df3 = pd.DataFrame({'A': [0, 1, 2, 3, 4],
                 'B': [5, 6, 7, 8, 9],                   
                'C': ['a', 'b', 'c', 'd', 'e']})
#to replace 0 with 5 in df3
df3.replace(0,5)

In [None]:
#replacing multiple instances "List-like"
df3.replace([0,1,2,3,4],4) #replaces all numbers with 4

In [None]:
# replaces 0>4, 1>3,2>2, 3>1 & 4>0
df3.replace ([0,1,2,3,4],[4,3,2,1,0])

### Discretization & Bining 

- When dealing with contnuous numeric data, it is better to separate them into bins for further analysis. 
- Pandas uses cut & qcut functions for binning
- cut function
    - It is used for defining specific bin edges 
    - To define bins of constant size 
- qcut function
    - Qcut is “Quantile-based discretization function” 
    - It divides the underlying data into equal sized bins. 
    - Bins are defined by using percentiles based on the distribution of data rather than the atucal numeric edges of the bins. 
- Lets assument that we have "age" column in our data set. Lets group them into discrete age buckets of 20 - 25, 26 - 35 , 36- 45 & > 46
ages = [21, 23, 35, 36, 21, 23, 47, 35, 63, 43, 49, 33,47, 35, 63, 43, 49, 33]

In [None]:
# Example binning  cut
ages = [21, 23, 35, 36, 21, 23, 47, 35, 63, 43, 49, 33,47, 35, 63, 43, 49, 33]
# to divide them into bins of 20 - 25, 26 - 35 , 36- 45 & > 46
# we will use cut function
bins = [20,25,35,45,60, 100]
age_bins = pd.cut(ages,bins)
# to chek the categories
age_bins.categories
#lets count the values for each bin
pd.value_counts(age_bins)

#Note: A parenthesis mean the side is open
#square brakcet means it is closed and inclusive
#possible to change which side is closed by passing right=False
age_bins1 = pd.cut(ages, bins, right=False)
age_bins1.categories

#It is possible to pass your own bin names 
# by using a list of array for the labels 
gnames = ['Young', 'YoundAdult', 'Adult','MiddleAged','Elderly']
bins_labels = pd.cut(ages, bins, labels=gnames)
bins_labels.categories

In [None]:
#Example binning qcut : qcut bins the data based on sample quantiles
import random
#lets generate some random numbers 
df4 =random.sample(range(0, 100), 20)
groups = pd.qcut(df4, 4)
groups.categories

### Detecting & Filtering Outliers

- The observations that deviates very much from other observatsions are outliers
- An outlier is simply an observation that is very different from the other observations. 
- There is no precise way of identifying and defining outliers. Some one (domain expert) will have to interpret the raw observations and decide if the value is an outlier or not. 
- Outlier may occur during the data collection phase as a result of mistake during data collection 
- It can also be an indication of variance in your data. 
- If outliers are the result of mistake, we can ignore them but if it is a variance in the data, it requires a further analysis. 

Example, in the table below, the average salary for CEOs'is between 80 to 90 K however Juha has a very high salary. This might be a typing mistake or it is showing the variance that suggests Juha has very high salary among the CEOs.

|CEO|Salary Per Yer| 
| :- | -: |
|Pekka| 80000|
|Juha| 180000|
|Vesa| 85000|
|Johanna| 90000|
|Sari| 87000|
|John| 85000|

#### Finding Outliers 

- There are several statistical methods for identifying outliers. 
- Two types of analysis for detecting an outlier
    - <b> Univariate </b>: Based on one variable.  "Univariate is a term commonly used in statistics to describe a type of data which consists of observations on only a single characteristic or attribute."
    - <b> Mutli-variate </b>: Based on two or more variables. 
- You can find outliers in multiple ways
    - Using Visualization Tools (box plot, scatter plot,histogram)
    - Z score (Mathematical function)
    - Dbscan (Density Based Spatial Clustering of Applications with Noise)
    - Isolation Forest
    - Using IQR Score

### Combining datasets concat & append.
- We need to combine different data sources. 
- Combining datasets can be simple concatenation of two different data sets or more complicated such as database joins. 

In [None]:
# Simple Concatenation with pd.concat
# pd.concat() can be used for a simple concatenation of Series or DataFrame objects
df6 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
df7 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([df6, df7])

In [None]:
#pd.concat for concatenating dataframes
df8 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'], index=[0,1,2])
df9 = pd.DataFrame(np.array([[10, 20, 30], [40, 50, 60], [70, 80, 90]]), columns=['a', 'b', 'c'], index=[3,4,5])
pd.concat ([df8, df9])

In [None]:
#by default the concatenation takes place row-wise within the DataFrame (i.e., axis=0)

df10 = pd.DataFrame(np.array([[1, 2], [4, 5]]), columns=['a', 'b'], index=[0,1])
df11 = pd.DataFrame(np.array([[6, 7], [8, 9]]), columns=['c', 'd'], index=[0,1])
pd.concat([df10, df11], axis=1)

# Read More: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html 

### Indices 
- pd.concat preserves indices even if the result will have duplicate indices

In [None]:
df12 = pd.DataFrame(np.array([[1, 2], [4, 5]]), columns=['a', 'b'], index=[0,1])
df13 = pd.DataFrame(np.array([[6, 7], [8, 9]]), columns=['a', 'b'], index=[0,1])
pd.concat ([df12,df13])

### Avoiding duplicate indices
- In the output above, notice that we have repeated indices. This is valid with in the dataframes but the out come is undesirable. We can hand them in following ways 
- consider repeats as an error
- To avoid overlapping, it is possible to specify the verify_integrity flag.

In [None]:
try:
    pd.concat([df12, df13], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)

### Ignoring the index
- Sometimes index does not matter. It is possible to ignore it with the ignore_index flag.
- The ignore_index flag set to true  will create a new integer index for the resulting series.

In [None]:
pd.concat([df12, df13], ignore_index=True)

### Multiindex key
- It is possible to use keys to specify a label for the data. 

In [None]:
pd.concat([df12, df13], keys=['d1', 'd2'])

### Concatenation with joins
- There will be situation where we need to join data from different sources that have different sets of column names inaddition to shared column names.
- pd.concat has several options for this kind of cases.

In [None]:
df = pd.DataFrame ({
    'name':  ['Andrew', 'Ben', 'Matthew', 'Johanna', 'Pekka'],
    'Data Processing': [1,5,5,4,3]
})
df1 = pd.DataFrame ({
    'name':  ['Andrew', 'Ben', 'Matthew', 'Johanna', 'Pekka'],
    'Advanced Data Processing': [5,4,3,2,1],
    'Databases':[5,4,3,2,1],
    'Maths':[5,4,3,2,1]
})
pd.concat([df, df1]) # by default the entries for which no data is available are filled with NA values.
# we can specify specify one of several options for the join and join_axes parameters
#  By defualty,the join is a union of the input columns (join='outer'),
# we can change this to an intersection of the columns using join='inner':

In [None]:
#now lets make the inner join
pd.concat([df, df1], axis=1, join='inner')

### The append() method
- "Append rows of other to the end of caller, returning a new object."

In [None]:
#example 1 append()
df.append(df1)

In [None]:
#ignore_index 
df.append(df1, ignore_index=True)