# Data Preprocessing Steps:


## Import the libraries

**numpy** will allow us to work with arrays, which will be expected as input for some machine models.

**matplotlib** will allow us to plot charts and graphs.

**pandas** will allow us to import the datasets, as well as create the matrix of features and the dependent variable 
vector. 

**sklearn.preprocessing** will allow us to scale features

**sklearn.model_selection.train_test_split** will allow us to randomly split our data into a set for training and a set for testing

**sklearn.impute** provides tools to allow us to deal with missing values

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

## Import and examine the dataset

[pima-indians-diabetes.csv](https://drive.google.com/file/d/1lWDk46jRhhFg8xY6Ga8bH6-6sWOpfGpM/view?usp=sharing)

In [None]:
df = pd.read_csv("data/pima-indians-diabetes.csv")

In [None]:
df.head()

In [None]:
df

In [None]:
df.info()

In [None]:
df.describe()

## Taking Care of missing data

Specifically, the following columns have an invalid zero minimum value:

1. Plasma glucose concentration
2. Diastolic blood pressure
3. Triceps skinfold thickness
4. 2-Hour serum insulin
5. Body mass index

In [None]:
cols_with_missing = pd.Series(['plasma_glucose','bp','skin_fold','2_hr_insulin','bmi'])

In [None]:
# count the number of '0' values for each column
num_missing = (df[cols_with_missing] == 0).sum()
# report the results
print(num_missing)

In [None]:
# replace '0' values with 'nan'
df[cols_with_missing] = df[cols_with_missing].replace(0, np.nan)

# print the first 20 rows of data
print(df.head(20))

In [None]:
# count the number of nan values in each column
print(df.isnull().sum())

### Alternative - eliminate any rows that contain NaN

In [None]:
# summarize the shape of the raw data
print(df.shape)
# drop rows with missing values
df1 = df.dropna()
# summarize the shape of the data with missing rows removed
print(df1.shape)


### Alternative - impute missing values

- A constant value that has meaning within the domain, such as 0, distinct from all other values.
- A value from another randomly selected record.
- A mean, median or mode value for the column.
- A value estimated by another predictive model.
- For time series data, average the previous and the succeeding values
- For categorical data, create a new category.  For instance, 'Missing Gender' for a 'Sex' column


In [None]:
# fill missing values with mean column values
# is can also be done in place if desired
df1 = df.fillna(df.mean())
# count the number of NaN values in each column
print(df1.isnull().sum())

#### The scikit-learn library provides the SimpleImputer pre-processing class that can be used to replace missing values. 

**Imputation strategies:**<br/>
- **mean:** replace missing values using the mean along each column. Can only be used with numeric data.
- **median:** replace missing values using the median along each column. Can only be used with numeric data.
- **most_frequent:** replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
- **constant:** replace missing values with fill_value. Can be used with strings or numeric data.
- **instance of Callable:** replace missing values using the scalar statistic returned by running the callable over a dense 1d array containing non-missing values of each column.

In [None]:
# retrieve the numpy array
values = df.values

In [None]:
# define the imputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
# transform the dataset
transformed_values = imputer.fit_transform(values)
# count the number of NaN values in each column
print('Missing: %d' % np.isnan(transformed_values).sum())

## Encoding Categorical data

[penguins2.csv](https://drive.google.com/file/d/16AOCJJar6igQC8v9KJeotX7wN3OuHMRx/view?usp=sharing)

In [None]:
df2 = pd.read_csv('data/penguins2.csv')

In [None]:
df2

In [None]:
df2.info()

In [None]:
df2.dropna(inplace=True)
df2.info()

In [None]:
df2.head()

In [None]:
numeric_columns = ['bill_length_mm','bill_depth_mm','flipper_length_mm',
                     'body_mass_g']

## Encoding the Independent Variable

#### Two categorical independent variables

In [None]:
df2['island'].unique()

In [None]:
df2['sex'].unique()

In [None]:
df2 = pd.get_dummies(df2, columns=["island","sex"])

In [None]:
df2

In [None]:
categorical_columns = ['island_Biscoe','island_Dream',
                       'island_Torgersen','sex_female','sex_male']
ind_columns = numeric_columns + categorical_columns

In [None]:
ind_columns

## Encoding the Dependent Variable

In [None]:
df2['species'].unique()

In [None]:
df2["label"] = np.where(df2["species"].str.contains("Adelie"), 1, 0)
df2

## Feature Scaling

**Feature Scaling** is necessary for distance-based machine learning algorithms such as kNN (k-Nearest Neighbors) and SVM (Support Vector Machine)

**Standardization** converts features to a range centered at 0, with 1 representing a standard deviation:

$$ x_{standardized} = \frac{x_{original} - \mu_x}{\sigma_x}  $$

**$ \mu_x $** is the mean and **$ \sigma_x $** is the standard deviation of feature . The standardized value is called a z-score. Since each unit represents one standard deviation, most z-scores fall between -2 and 2.


In [None]:
original = df2[numeric_columns]
# Standardize dataframe and return as an array
standardizedArray = preprocessing.scale(original)

# Convert standardized array to dataframe 'standardized'
standardized = pd.DataFrame(standardizedArray, columns=numeric_columns)

In [None]:
standardized

**Normalization** converts features to the range [0,1]:

$$ x_{normalized} = \frac{x_{original} - Min_x}{Max_x - Min_x} $$

- Normalization is often used when a feature does not have a Guassian distribution.
- Note that normalized outliers will be bound between [0,1], while standardization is unbounded and outliers will not be affected. 


In [None]:
# Normalize dataframe and return as an array
normalizedArray = preprocessing.MinMaxScaler().fit_transform(df2[numeric_columns])

# Convert normalized array to dataframe 'normalized'
normalized = pd.DataFrame(normalizedArray, columns=numeric_columns)
normalized

***Python data structuring methods.***
| Method	| Parameters	| Description |
| :-------- | :------------ | :---------- |
| string[start:end]	| none	| Returns the substring of string that begins at the index start and ends at the index end - 1. |
|string.capitalize()<br>string.upper()<br>string.lower()<br>string.title()	| none	| Returns a copy of string with the initial character uppercase, all characters uppercase, all characters lowercase, or the initial character of all words uppercase. |
| to_datetime()	| arg	| Converts arg to datetime data type and returns the converted object. Data type of arg may be int, float, str, datetime, list, tuple, one-dimensional array, Series, or DataFrame. |
| to_numeric()	| arg	| Converts arg to numeric data type and returns the converted object. Data type of arg may be scalar, list, tuple, one-dimensional array, or Series.|


**pandas data structuring methods.**

| Method	| Parameters	| Description |
| :-------- | :------------ | :---------- |
| df.astype()	| dtype<br>copy=True	| Converts the data type of all dataframe df columns to dtype. To alter individual columns, specify dtype as {col: dtype, col:dtype, . . .}. |
| df.insert()	| loc<br>column<br>value	| Inserts a new column with label column at location loc in dataframe df. value is a Scalar, Series, or Array of values for the new column. |

***Python data enriching methods.***
| Method	| Parameters	| Description |
| :-------- | :------------ | :---------- |
| concat()	|objs<br>axis=0<br>join='outer'<br>ignore_index=False	| Appends dataframes specified in objs parameter. Appends rows if **axis=0** or columns if **axis=1**. join specifies whether to perform an 'outer' or 'inner' join. Resulting index values are unchanged if **ignore_index=False** or renumbered if **ignore_index=True**. |
| df.apply()	| func<br>axis=0<br>	| Applies the function specified in func parameter to a dataframe df. Applies function to each column if **axis=0** or to each row if **axis=1**. Returns a Series or DataFrame. |
| df.insert()	| loc<br>column<br>value	| Inserts a column to df. **loc** specifies the integer position of the new column. **column** specifies a string or numeric column label. **value** specifies column values as a Scalar or Series. |
| df.merge()	| right<br>how='inner'<br>on=None<br>sort=False	|Joins df with the right dataframe. **how** specifies whether to perform a **'left'**, **'right'**, **'outer'**, or **'inner'** join.  **on** specifies join column labels, which must appear in both dataframes. If **on=None**, all matching labels become join columns. **sort=True** sorts rows on the join columns.|


## Split the dataset into training and testing sets

In [None]:
df2.info()

In [None]:
# Store relevant columns as variables
X = df2[ind_columns].values
# create a 1-D numpy array
y = df2[['label']].values.ravel()

In [None]:
X.shape

In [None]:
y.shape

In [None]:
type(df2[ind_columns].values)

In [None]:
trainX,testX,trainY,testY = train_test_split(X, y, test_size=.2, random_state=42)

print('Split X: ',trainX.shape, testX.shape)
print('Split Y: ',trainY.shape, testY.shape)

**Leading public datasets.**
| Name	| Link	| Description |
| :---- | :---- | :-----------|
| Kaggle	| kaggle.com	| Over 50,000 datasets on a broad range of subjects. Also provides Jupyter notebooks that analyze the datasets. |
| FiveThirtyEight	| data.fivethirtyeight.com	| Datasets on politics, sports, science, economics, health, and culture, initially developed to support FiveThirtyEight publications. |
| University of California Irvine Machine Learning Repository |	archive.ics.uci.edu	| 622 datasets, primarily in science, engineering, and business. |
| Data.gov	| data.gov	| U.S. government datasets on agriculture, climate, energy, maritime, oceans, and health. |
| World Bank Open Data	| data.worldbank.org	| Global datasets on subjects such as health, education, agriculture, and economics. |
| Nasdaq Data Link	| data.nasdaq.com	| Financial and economic datasets. |
| NYC Open Data	| opendata.cityofnewyork.us	| NYC government services datasets. |

