# Data prep 

Data prep is about getting the dataset ready for analyzing. It involves ETL (Extract, Transform, Load), and cleaning the dataset and getting it ready: dealing with missing values, duplicate values, categorial variables, outliers, transforming variables, and adding new variables.
Merging datasets is also part of data prep but is covered in a later notebook. Lagging values/forward values are also covered later on.

These are the basic data prep tasks:

- Info on datasets (describe and info methods)
- Dealing with missing data
- Turning categorial (factor) variables into dummy/indicator variables
- Detecting outliers 
- Normalizing, standardizing
- Applying functions (on rows or columns)
- Binning (separate notebook)


## Info on datasets

### Sample dataset

In [None]:
import pandas as pd
import numpy as np

# read sample dataset
data = pd.read_csv('../datasets/feedback.csv')
data.head()

### Info function

The info function displays a summary of the dataframe. It displays column names, data types, the number of non-null values, and memory usage.<br>

__Syntax:__ DataFrame.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None, null_counts=None)<br>

Where

- verbose : Determines whether the full summary is to be printed. Takes a bool value.
- buf : Determines where to send the output.
- max_cols : Tells when to switch from verbose to truncated output. Takes int variable and if dataframe has more than max_cols, use truncated output.
- memory_usage : Determines whether total memory usage of the dataframe elements should be displayed.
- show_counts : Determines whether to show non-null value counts or not. A value of True will always show counts, while False never shows counts.

__Return value:__ The info function does not return anything. Instead, it outputs a concise summary of the dataframe.

In [None]:
# with verbose = True
print(data.info(verbose=True))

### Column data type
  
- dtypes: get the data types
- astype(): set the data type ("float", "int", "object" - which means string)

In [None]:
# get data types
data.dtypes

In [None]:
# change data type for price to float
data[['price']] = data[['price']].astype("float")

### Describe function

The describe() method returns description of the data in the DataFrame.<br>

__Syntax:__ DataFrame.describe(percentiles=None, include=None, exclude=None)<br>
Here,
- percentile: list of numbers between 0-1 for respective percentiles
- include: list of data types to be included while describing dataframe, Default = None
- exclude: list of data types to be excluded while describing dataframe, Default = None

__Return:__ Statistical summary of data frame

In [None]:
# describe function on a single numerical variable
data['number_of_reviews'].describe()

In [None]:
# describe function on a single string variable
data['name'].describe()

In [None]:
# Count the distinct values
data['name'].value_counts()

In [None]:
# percentile list
perc =[.20, .40, .60, .80]
# list of dtypes to include
incl =['object', 'float', 'int']
# calling describe method
myInfo = data.describe(percentiles = perc, include = incl)
print(type(myInfo))

In [None]:
# display
myInfo

## Dealing with missing data

### Dropna function

The dropna function will drop all observations that have any number of observations missing. This is quite aggressive; sometimes you may want to replace missing values with a zero, or average, etc.

In [None]:
# read sample dataset
data = pd.read_csv('../datasets/feedback.csv')
print ('number of rows in original data:', data.shape[0])
# removing null values to avoid errors
# inPlace = True: will change the dataframe (data) (otherwise assign it to a new dataset: data2 = data.dropna() )
data.dropna(inplace = True)
print ('number of observations after dropping missing values:', data.shape[0])

In [None]:
#get rid of duplicates
print('number of observations before dropping duplicates:', data.shape[0])
data.drop_duplicates()
print('number of observations after dropping duplicates:', data.shape[0])

### isnull, notnull functions

The isnull() method returns a DataFrame object where all the values are replaced with a Boolean value True for NULL values, and otherwise False.<br>
Thus, it detects missing values for an array-like object.

The notnull() method works the opposite (True if not NULL, False otherwise)

In [None]:
# re-read the data so we have all observations again (including the missing values)
data = pd.read_csv('../datasets/feedback.csv')
# locate missing data, note that the 'False' means that the value is not NULL
data.isnull()

In [None]:
# sum of list of booleans
sum([True, False, False, True])

In [None]:
# expand missing data of each feature and count missing values in each column
data.isnull().sum()

In [None]:
# let's get the 16 observations with missing/NULL name
# filter: data['name'].isnull()
# results in a list of booleans (True, False), only the True values end up in the filtered dataframe
filtered_df = data[ data['name'].isnull() ]
print('#rows with missing name:', filtered_df.shape[0])
filtered_df.head()

In [None]:
# replace missing name values
data['name'] = data['name'].fillna('Unknown name')

In [None]:
# what if price was missing, and you want to replace it with the sample-wide average?
price_avg = data["price"].mean()
print("Average price:", price_avg)
# replace it, np.nan is the value that is replaced by price_avg
data["price"].replace(np.nan, price_avg, inplace=True)

## Turning categorial (factor) variables into dummies

In the sample dataset there are three room types. Let's turn that into three dummies.

In [None]:
# Count the distinct values
data['room_type'].value_counts()

In [None]:
data['room_type'].head()

In [None]:
# create dummy variables (new dataset)
dummies = pd.get_dummies(data["room_type"])
dummies.head()

In [None]:
# add the dummy variables to the dataset (but drop the first)
# the reason to drop one is to have a hold-out group
# (if a regression has an intercept then only 2 of the 3 dummies can be included in the regression)
data = pd.get_dummies(data, columns=['room_type'], drop_first=True)
data.head()

In [None]:
# rename variables 
data.rename(columns={'room_type_Private room':'private', 'room_type_Shared room':'shared'}, inplace=True)
data.head()

## Data Normalization

It is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, the variance is 1, or the variable values range from 0 to 1.<br>
Approach: replace original value by (original value)/(maximum value).<br>

In [None]:
# let's scale price such that it is between 0 and 1
data['price_scaled'] = data['price']/data['price'].max()
data.head()
# note: if price also had negative values, then divide by the max of the absolute value
# data['price_scaled'] = data['price']/data['price'].abs().max()

## Standardizing (mean 0, standard deviation 1)

Another approach is to standardize variable such that the variable has a mean of 0 and a standard deviation of 1.

So, we subtract the mean of the variable and divide by the standard deviation.

## Applying functions (on rows or columns)

### Apply function

This is a function to apply to each column or row.

__Syntax__ : DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)<br>

Here,<br>
- func - Function to apply to each column or row.<br>
- axis{0 or ‘index’, 1 or ‘columns’}, default 0 - Axis along which the function is applied:<br>
0 or ‘index’: apply function to each row .<br>
1 or ‘columns’: apply function to each column.<br>    
- rawbool, default False- Determines if row or column is passed as a Series or ndarray object:<br>
False : passes each row or column as a Series to the function.<br>
True : the passed function will receive ndarray objects instead.<br>
- result_type{‘expand’, ‘reduce’, ‘broadcast’, None}, default None - These only act when axis=1 (columns):<br>
‘expand’ : list-like results will be turned into columns.<br>
‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.<br>
‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.<br>
- args : tuple - Positional arguments to pass to func in addition to the array/series.
- **kwargs : Additional keyword arguments to pass as keywords arguments to func.<br>

__Returns__ : Series or DataFrame (Result of applying func along the given axis of the DataFrame)<br>
    
See: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html
    

In [None]:
# example
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
df

In [None]:
# apply the function on each cell
df.apply(np.sqrt)

In [None]:
# axis: 0 (default) -- sums the columns
df.apply(np.sum, axis=0)

In [None]:
# axis: 1 -- sums the rows
df.apply(np.sum, axis=1)

### Map function

We can also use the map() function to apply functions on rows. <br>

Let's assume we have a dataset that has gender as a string ("M" or "F"). We can reshape the data by turning this into an indicator (dummy) variable, say 1 for "F" and 0 otherwise.

We can write a function using apply, or use the map() function. The map function will apply a function to each of the elements and returns a 'map object', which can be turned back into a list.

In [None]:
# map example 
my_list = [2.6743,3.63526,4.2325,5.9687967,6.3265,7.6988,8.232,9.6907]
# apply round function to the list
updated_list = map(round, my_list)
# this will print a map object
print(updated_list)
# but this can be turned back into a list
print(list(updated_list))

In [None]:
#Assign data
data = {'Name': ['Jax', 'Prince', 'Gaunther',
                 'Emanuel', 'Ron', 'Natasha', 'Lexi'],
        'Age': [17, 17, 18, 17, 18, 17, 17],
        'Gender': ['M', 'M', 'M', 'M', 'M', 'F', 'F'],
        'Marks': [90, 76, 'NaN', 74, 65, 'NaN', 71]}
 
# Convert into DataFrame
df = pd.DataFrame(data)
df

In [None]:
# Categorize gender in Example 'A'
df['Gender'] = df['Gender'].map({'M': 0,'F': 1, }).astype(int)
df