# Data mining & Machine Learning (cheat-sheet)

#### The course evaluation structure
* material (strings methods, pandas, regex, web-scrapping)
* Exercises
* Project I (in groups of 3) - please send me an email with your team members
* material (NLP, OCR data extraction, JSON, parallel+newtork programmin)
* Exercises
* Project II (personal or group of 2)

*we have arrived to the first project (half of the course)

* until now you have done some simple but I hope interesting exercises (and usefull too)
* now is time for some project
* so, today's class is to prepare you for Project I
* how much time do you need for this project (month)? (in parallel you will also need to do some simple exercises as before)
* for some exceptional results I will email the operational manager of the company (hope he will not kill me that I gave you the data)
* the project in fact may easily take a year to be completed (or never :) but I want you to have a feeling of it.
* **focus on the data, what data you wish to have, understand the problem, organize the data and prepare for a Machine Learning Algorithm**
* I want you to work in groups for the knowledge exchange - discussion etc (learn something from a collegue about machine learning but some basics of ML are here too)
* before the problem is presented... a bit of theory of ML, bacause you need to know what do you need the data for
* the project is business related

#### importance of data and ML in business (and science too)

![](imgs/CD_for_business.png)

   <a id="progr_vs_ml"></a>
## Is data important in machine learning and vice-versa?

![](imgs/MD_ML_relationship.jpg)

https://jakevdp.github.io/PythonDataScienceHandbook/

#### it works both ways: 
  * machine learning can be used to data mine (finding relationship or importance of a feature is also data mining - extracting important infomation from data)
  * data mine can be used for machine learning (machine learning needs lot of data to work better)

# Difference between traditional programming and machine learning, and its consequence.

#### Traditional Programming

Traditional programming is a manual process, which means that a person (programmer) creates the program, the person has to manually formulate or code rules

![](imgs/tradicional_programming.png)

Not much data is needed here because the relationship (program/algorithm) is known, and for each input output can be obtained easily. 

#### Machine learning programming
On the other hand, in machine learning, the algorithm automatically formulates the rules from the data.

So, unlike traditional programming, machine learning is an automated process. You can increase the value of your built-in analytics in many areas, including data preparation, natural language interfaces, automatic outlier detection, recommendations, and causality and significance detection. All of these features help speed up user insights and reduce decision bias.

![](imgs/machine_learning_new.png)

A lot of data is needed here because you are looking for the relationship between input and output.

### Example. Convert Celsius to Fahrenheit using traditional and machine learning approach.
* In ***traditional programming*** you can write a function that takes input in celsius and returns output in fahrenheit using the formula F = C*1.8 + 32F = C * 1.8 + 32F = C∗1.8 + 32 Note that the program was explicitly given the instructions to execute.

![](imgs/trad_C_to_F.png)

* With the ***machine learning*** approach, the input and output are known and the relationship between the input and output is learned by the model.

![](imgs/ml_C_to_F.png)

# There are several areas of data mining and machine learning that are most common nowadays:

*    Predictive Modelling. Regression and classification algorithms for supervised learning (prediction), metrics for evaluating model performance.
*    Clustering. Methods to group data without a label into clusters: K-Means, selecting cluster numbers based objective metrics.
*    Dimensionality Reduction. Methods to reduce the dimensionality of data and attributes of those methods: PCA and LDA.
*    Feature Importance. Methods to find the most important feature in a dataset: permutation importance, SHAP values, Partial Dependence Plots.
*    Data Transformation. Methods to transform the data for greater predictive power, for easier analysis, or to uncover hidden relationships and patterns: standardization, normalization, box-cox transformations.

# Most common algorithms of ML:

![](imgs/ML_algorithms.png)

* This project will be related to supervised regression algorithm

# Independently of what algorithm we use, the ML needs data!!!
* look again at the Celsius to Fahrenheit conversion
* so the more data (correct data) you give to the ML the more precise coefficients it will find

#### Data is the most important and is a "must-have" food for machine learning. 
It can be any fact, text, symbols, images, videos, etc., but in unprocessed form. When the data is processed, it is known as information. Machine learning without data is nothing but a bare machine with no soul and no mind. This data makes machines do such amazing tasks, which we have not thought of a few years back in history.

![](imgs/ML_performance.png)

* guess a person I have on my mind
* male/famale
* approximate localization in the class room
* height
* hair color
* clothes  
... with more information is easier to guess

#### To “teach the machine”...
you wish information. As an example, if you would like to train a neural web for predicting the winner of a football match, you can’t merely look at who won the game last year. You will wish for a lot of information -  maximum amount you’ll be able to get. 
* You want each stat for each player ideally for his or her entire career, 
* a place where the game will take place, 
* the altitude etc... A lot of information you’ve got, a lot of the neural web will learn from the same details. 
Thus essentially, data mining and data processing are one of the earliest steps toward machine learning. You mine the info, then organize, normalize, etc.

# Two step process of Data Mining -> Machine Learning work flow

### Part I Preparing data for machine learning:
* close the computer and think
* collect the data !!!!, this is one of the most important steps, to make sure they are complete, that there are no missing features that influence the price of the house.
* That the data have no errors, typographical error values, large extraordinary values or small extraordinary values (outliers).
* That the rows are not repeated.
* That the columns are not repeated.
* That there are no empty values in the columns.
* That there is no strong correlation between the data in the columns (if there is a column with the size of the house in m^2 and another column with the size of the house in ft^2, then one is redundant).
* Which variables are categorical and which are continuous.
* scale variables to have the same order of magnitude
* convert categorical values to numerical (label encoding or one-hot encoding)
* Finally, it is not possible to perform all these operations blindfolded, to understand the data you must visualize it, order it and group it.
* close the computer and think

### example part I analysis  is in the folder 

### * collect the data is the key part
lets imagine you want to predict the price of a house (I know we have seen this hundreds of times, but its a good very intuitive example :) What informations (inputs - `features`) you would like to have in order to predict the price of a house (output - `target`) ?
* how mnownrs

## All of this is basically Pandas presented as an example in this folder


### one thing that is not presented well is converting categorical variables into numerical ones
#### machine learning has preference - numerical data over other types of data (like strings etc)

![](imgs/data_for_machine.png)

### it is often useful to convert non-numerical data into numerical data
#### remember converting A(automatic) / M(manual) into 1/0 or 0/1
#### btw, was it surprising that M consumes less gasoline than A?

# One-hot encoding

![](imgs/one_hot_encoding.png)

In [1]:
link='https://raw.githubusercontent.com/mhemmg/datasets/master/drugset/drug200.csv'
import pandas as pd
my_data = pd.read_csv(link, delimiter=",")
my_data.head(3)

In [2]:
link='https://raw.githubusercontent.com/mhemmg/datasets/master/drugset/drug200.csv'
import pandas as pd
my_data = pd.read_csv(link, delimiter=",")

df_dummy = pd.get_dummies(my_data.Sex, prefix='Sex')
df_concat = pd.concat([my_data, df_dummy], axis=1)
df_concat.drop(['Sex'],axis=1,inplace=True)
df_concat.head(3)

Unnamed: 0,Age,BP,Cholesterol,Na_to_K,Drug,Sex_F,Sex_M
0,23,HIGH,HIGH,25.355,drugY,1,0
1,47,LOW,HIGH,13.093,drugC,0,1
2,47,LOW,HIGH,10.114,drugC,0,1


* question: imagin you're predicting price of a house using large dataset... can you convert number of bedrooms into one-hot encoding? 

# Label encoding

![](imgs/label_encoding.png)

In [3]:
link='https://raw.githubusercontent.com/mhemmg/datasets/master/drugset/drug200.csv'
import pandas as pd
my_data = pd.read_csv(link, delimiter=",")

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
my_data['Sex_Cat'] = labelencoder.fit_transform(my_data['Sex'])
my_data.drop(['Sex'],axis=1,inplace=True)
my_data.head(3)

Unnamed: 0,Age,BP,Cholesterol,Na_to_K,Drug,Sex_Cat
0,23,HIGH,HIGH,25.355,drugY,0
1,47,LOW,HIGH,13.093,drugC,1
2,47,LOW,HIGH,10.114,drugC,1


# But do you think machine learning prefers one-hot encoding or label encoding??
answer:
<font color='white'>The problem here is since there are different numbers in the same column, the model will misunderstand the data to be in some kind of order, 0 < 1.

The model may derive a correlation like as the country number increases the population increases but this clearly may not be the scenario in some other data or the prediction set. 
 Seems its better to use One Hot Encoder</font>

#### Important
In machine learning the input data is usually stored in a variable called `X_data` and the output data in `y_data`. Having `X_data` and `y_data` the machine learning algorithm will try to find the relationship between the input and the output - what is called machine learning. That is why data (quality and quantity) are so important in this discipline.

# real life example: a bank based on historical data wants to predict if the customer will pay his load back or not. How is the data organized, what  is input what is output (and the ML will try to find the correlation between those two)

![](imgs/machine_learning_data_input_output.png)

* X_data=input
* y_data=output

# Part II: Machine Learning (with Scikit-Learn for now, no NN)

## Once the data is ready we do the following steps:
* standarize / normalize the data
* split the data into train / test set
* modeling (fitting the data)
* testing model's performance

### split the data into train / test set
* to really test how well the model has been train we have to ask him to predict same values based on values that he DID NOT SEE during training.
* so to train the ML we only use a larger part of the data called X_train and y_train (X_features and y_targets)

![](imgs/test_train_split.png)

```from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3)```

### standarize / normalize the data

* the machine learning searches for coefficients (related to the features) and the set of coefficients to find must create an output that is as close to the target as possible
* its like linear regression y=a*x +b (where x is for example size of the house and y is its price)
* this kind of problem is called minimization problem - the ML algorithm is trying to find a and b such that $a*x +b$ will be as close to y as possible
* but if the problem is multidimensional (size of the house and number of rooms) the minimization has to deal with different dimensions (size of house is in $ft^2 $ $ \sim$ 10000 and number of rooms $\sim$ 5)

![](imgs/gd_asymmetry.png)

```
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)
```

### modeling (fitting the data)

# https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model

![](imgs/sklearn_fit.png)

### testing model's performance

```
print(f'R2: {model.score(X_test, y_test):.2f}')
```

# A full example of a machine learning implementation (house price prediction based on various characteristics) is in the class folder  

1. EDA I (Exploratory Data Analysis I)
  * exploring cleaning the data from empty values and NaN
  * visualization techniques
2. EDA II (Exploratory Data Analysis II)
  * removing outliers
  * checking the correlation
3. Machine Learning example with Scikit-Learn
  * more features provides with better prediction
  * interpretation of the results R2
  * importance of features
  
(Im not saying that for the followig project you need to do what was for the price prediciton) - just for reference
* its a lot of material so you are divided into groups (at least one person knows ML) and the idea is that those who know more of ML can share and explain some concepts to the other collegues

# Please review the example above by yourself and dont hesitate to contact me if you have any doubts (this course is not about machine learning so you have rights to be confused about this material)
- lets only look at one detail in the Scikit-Learn implementation.