# Machine Learning

In [19]:
# The packages for this file:
import pandas as pd
import numpy as np
import math
import sklearn
from sklearn.preprocessing import StandardScaler

# template:
temp = pd.DataFrame({"Race": ["White", "White", "Asian", "Black", "Declined", "Asian", "Black"], 
                    "Time": ["11/04/2023", "12/04/2023", "13/04/2023", "14/04/2023", "15/04/2023", "16/04/2023",
                             "17/04/2023"], 
                    "hoursStayICU": [10, 20, 30, 40, 50, 60, 70]})

## 1. Types

1. **Supervised Learning**: Given some data and some outcome, use the data to predict the outcome.  
2. **Unsupervised Learning**: Use the data to learn some underlying latent structure. This structure can then be used to generate more data. It can also be used downstream in other types of learning. Note: There are no outcomes to predict in this case. 
3. **Reinforcement Learning**: Given a reward signal, tell me what actions I should take to maximize this signal.

## 2. What matters

1. **Data**: in case of "Garbage in, garbage out".  
2. **Models**: What kind of model you use, it will be the focus but before it please make sure the quality of the data.  
3. **Criterion**: How do we say a model is good enouph.

## 3. Data Cleaning

We start with Data cleaning. You need to understand the data in order to correctly select a machine learning model. The first step to understanding the dataset is importing it and then probing it to ask questions. This is **Exploratory Data Analysis** (EDA). When importing data, there are many things that can go wrong immediately. This results in the need for Data Clearing. The problems you may face to:  
1. The data is malformed. 
2. There are missing values.
3. The data has different scales.
4. There are outliers.
5. There is under-sampling or over-sampling.
6. The data has mixed modalities, e.g. categorical and continuous values.  
  
(The first project is about EDA with the synthetic data of credit fraudulent data)

### 3.1 The data is malformed.  
You can use a debugger and try to read the lines in one at a time. With each line, you can check that the
imported values are within a standard range. If the line is malformed (i.e. a row has 6 columns and the rest of the file has 7), it’s usually best to throw out the entire row.

### 3.2 Missing values
Intuitively, we lack the data.  
1. Drop the samples with missing values. However, this may lead to the information loss. You may try it if the missing values are little and the dataset is mostly clean.  
2. Impulation. 
    1. For the categorical ones, you can add a new tag like: "Unknown". For the continuous ones, there are a lot of ways, including simply replace them with 0, mean value or some special case (Like population in a state, we can impute it with average population density calculated multiply the area of the missing state).
    2. Machine learning methods: 
        1. do K-means clustering on the data with the missing feature dimension removed. Assing the missing datapoint to a cluser and then impute with the average of the cluster.
        2. Fit a linear model to do prediction 
        3. More complicate method: GAN (Generative Adversarial Neutral network). For example, for an image with a part of filtered, we can generate a complete one. **I may complement it later**

### 3.3 The dataset has different scales
Many machine learning models will be sensitive to this difference in scale. Some data may record the percentile information, some may record data in different units: 1 hundred, 1.4 million in different columns. 
Normalize them: 
1. divide by the maximum: pro: interpretable; con: get affected by the outlier.
2. standardize it by $\frac{x - \bar{x}}{\sigma_x}$ pro: affect less by outlier; con: less interprebable.

In [23]:
# we can calculate it by calculation and also use packages in sklearn
scaler = StandardScaler()
standardVariable = scaler.fit_transform(temp[["hoursStayICU"]])
standardVariable

array([[-1.5],
       [-1. ],
       [-0.5],
       [ 0. ],
       [ 0.5],
       [ 1. ],
       [ 1.5]])

### 3.4 There are outliers  
There are many contions and methods for outliers. 
1. Drop According to the characteristics of the variable, we can drop the data violates the ground-truth directly, we can also set a limit like the minimum for our data to drop. 
2. Statistical method: We can also visualiza it with histograms, scatter plots or using some statistical methods to find the influential points (like Cook's distance, DFFIT).
3. Special attention: z-score, it is widely used that we use z-score = 3 to cut the data outside the range assuming it follows a Gaussian distribution.  
4. For high-dimension data, z-score may not be able to use. We solve it with anomaly detection. **Flag it and complement it later**.

### 3.5 There is under-sampling or over-sampling.

### 3.6 The data has mixed modalities, e.g. categorical and continuous values.  

There are many cases. Generally, the categorical, numerical data and time series data. You may try following codes:  


In [14]:
print(temp.dtypes)
print("--------------------------------------------------")
# 1. Turn categorical data to one-hot encoding
oneHot = pd.get_dummies(temp['Race'])
print(oneHot.head(2))
print("--------------------------------------------------")
# 2. Turn time String to time format data
time = temp["Time"].astype("datetime64") ## DD/MM/YYYY
print(time.dtypes)
# 3. The loss for continuous and categorical data are different, like MSE for cont. and CrossEntropy for categorical. 
#    It often happens when the predictive variables are not unique of different kinds. We may try predict them 
#    separately or add a weight: a*L_dis + (1 - a)*L_cont

Race    object
Time    object
dtype: object
--------------------------------------------------
   Asian  Black  Declined  White
0      0      0         0      1
1      0      0         0      1
--------------------------------------------------
datetime64[ns]


  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
