# <font color='#D3A550'>Supervised Learning with Scikit-Learning</font>
##### Source: Datacamp, w3Schools, geekforgeeks, and ChatGPT

## 1) Machine Learning: 
	The art and science of giving computers the ability to learn to make decisions
	eg: Predict email is spam or not

	1) Supervised learning: Uses labeled data
		-Classification: target variable consists of categories eg: YES/NO, 1/0, R/G/B
		-Regression: Target variable is continuous: House Pricing
	2) Unsupervised learning: Uses unlabeled data

## 2) Important naming conventions:

Features = predictor variables = independent variables, also known as the feature columns/dimensions

Target = dependent variable = response variable, also known as output

## 3) Libraries for Machine Learning:
Scikit-Learn, TensorFlow, keras

## 4) Exploratory data analysis or EDA
	- Numerical EDA:
        - Calculating basic statistics such as mean, median, mode, standard deviation, and quartiles for each numerical feature
        - Calculating the correlation matrix of all numerical features to identify potential correlation among features
        - Checking for outliers using box plots, scatter plots, and z-scores

    - Visual EDA:
        - histograms to visualize the distribution of individual numerical features
        - scatter plots to visualize the relationship between two numerical features
        - bar plots and pie charts to visualize the distribution of categorical features
        - pair plots and heat maps to visualize the relationship between multiple numerical features

## 5) .fit() VS .predict()

.fit() 
	- train a model on a given datasets
	- takes 2 arguments(X_train, y_train) where X_train is the features, and y_train is the output/target variable
	
.predict() 
	- make predictions on new data using trained model
	- takes 1 argument, the new data to make prediction eg: .predict(X_test)

## 6) kNN or k-Nearest Neighbors: one of the many supervised learning algorithm
	-Predict the label of a data point by looking at 'k' closest labeled data points

## 7) Train and Test
-Train/Test Split: in machine learning, data is usually split into 2: train(train a model) and test(evaluate performance)

Original dataset:

| id | X1 | X2 | Y  |
|----|----|----|----|
| 1  | 3  | 4  | 1  |
| 2  | 5  | 2  | 0  |
| 3  | 8  | 6  | 1  |
| 4  | 1  | 9  | 0  |
| 5  | 4  | 2  | 1  |

Training set:

| id | X1 | X2 | Y  |
|----|----|----|----|
| 1  | 3  | 4  | 1  |
| 3  | 8  | 6  | 1  |
| 4  | 1  | 9  | 0  |

Test set:

| id | X1 | X2 | Y  |
|----|----|----|----|
| 2  | 5  | 2  | 0  |
| 5  | 4  | 2  | 1  |

Here, the training set is used to train a model. The test set is used to evaluate the performance of the model.
The model will use the input features X1 and X2 of the training set to learn the relationship between the input and output variables 
it will use the input features of the test set to predict the output variable and compare it with the actual output.

## 8) egg kNN using digits datasets in sklearn

### Original dataset (digits):

| id | X1 | X2 | X3 | ... | X64 | Y  |
|----|----|----|----|-----|-----|----|
| 1  | 0  | 1  | 5  | ... | 2   | 3  |
| 2  | 4  | 0  | 0  | ... | 7   | 2  |
| 3  | 3  | 7  | 9  | ... | 1   | 4  |
| 4  | 1  | 0  | 2  | ... | 8   | 1  |
| 5  | 2  | 8  | 8  | ... | 6   | 9  |

Training set:

| id | X1 | X2 | X3 | ... | X64 | Y  |
|----|----|----|----|-----|-----|----|
| 1  | 0  | 1  | 5  | ... | 2   | 3  |
| 3  | 3  | 7  | 9  | ... | 1   | 4  |
| 4  | 1  | 0  | 2  | ... | 8   | 1  |
| 5  | 2  | 8  | 8  | ... | 6   | 9  |

Test set:

| id | X1 | X2 | X3 | ... | X64 | Y  |
|----|----|----|----|-----|-----|----|
| 2  | 4  | 0  | 0  | ... | 7   | 2  |

### Random state and Stratify
random_state: Say you set random_state=42, the function will always split the data 
in the same way, if you run the script multiple times, 
you will have the same training and test sets

stratify: Say you have a dataset with 90% of samples belonging to class 1 and 10% of samples 
belonging to class 2, stratify parameter will ensure that the same proportion (90% and 10%)
maintained in both the training and test sets

In [23]:
# Import necessary modules
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# load the digits dataset
digits = load_digits()
# Create feature and target arrays
X = digits.data
y = digits.target

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=7)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test))

0.9833333333333333


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


## 9) Linear Regression

Linear regression is a machine learning algorithm that predicts a continuous output value based on input features by finding the best fit line through the data points.

|                             | KNN                                      | Linear Regression                    |
|-----------------------------|------------------------------------------------|--------------------------------------------------|
| Method                     | Non-parametric                              | Parametric (assumes linearity)            |
| Prediction Type            | Majority class among k nearest examples | Equation of a straight line (y=mx+b)  |
| Type of Label              | Discrete                                      | Continuous                                |




In [24]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# Read in the data
df = pd.read_csv('datasets\gm_2008_region.csv')

# One-hot encode the 'Region' column
df_encoded = pd.get_dummies(df, columns=['Region'], prefix='Region')

# Assign the encoded Dataframe to X
X = df_encoded.drop(['life'], axis=1)
y = df_encoded['life']

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create the regressor
reg_all = LinearRegression()

# Fit the regressor to the training data
reg_all.fit(X_train, y_train)

# Predict on the test data
y_pred = reg_all.predict(X_test)

# Compute and print R^2 and RMSE
print("R^2: {}".format(r2_score(y_test, y_pred)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))


R^2: 0.8219419939586892
Root Mean Squared Error: 3.4052481157341434


## 10) Cross validation technique

Cross-validation is a technique used to evaluate the performance of a machine learning model by training the model on different subsets of the data and evaluating it on the remaining parts. The most common one is k-fold cross-validation. In k-fold cross-validation, the data is divided into k subsets, or "folds", and the model is trained on k-1 of the folds and evaluated on the remaining one. This process is repeated k times, with a different fold being used as the test set each time.

cross_val_score() is a function from the sklearn.model_selection module that allows you to easily perform k-fold cross-validation on a model. 
The function takes the following arguments:
    estimator: the model you want to evaluate eg: LinearRegression()
    X: the feature data
    y: the target data
    cv: the number of folds to use (k)

Note that cross_val_score will take care of splitting the data into training and test sets for each fold and fitting the model to the training data and evaluating it on the test data.

In [25]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Read in the data
df = pd.read_csv('datasets\gm_2008_region.csv')

# One-hot encode the 'Region' column
df_encoded = pd.get_dummies(df, columns=['Region'], prefix='Region')

# Assign the encoded Dataframe to X
X = df_encoded.drop(['life'], axis=1)
y = df_encoded['life']

# Create the regressor
reg_all = LinearRegression()

# Perform 3-fold CV
cvscores_3 = cross_val_score(reg_all, X, y, cv=3)
print(np.mean(cvscores_3))

0.858814795800147


## 11) Hyperparameter Tuning

Hyperparameter tuning is the process of searching for the best combination of hyperparameter for a machine learning model to achieve the best performance on unseen data.
Examples of hyperparameter include  the number of nearest neighbors (N) in KNN and the regularization strength in logistic regression (C)

GridSearchCV: Popular method for hyperparameter tuning in scikit-learn. It's used to search for the best combination of hyperparameters by training the model on different combinations of the hyperparameters and evaluating its performance using cross-validation.

#### GridSearchCV

In [31]:
# Example of hyperparameter tuning with logistic regression using diabetes datasets with GridSearchCV
# Parameters for LogisticRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

df = pd.read_csv('datasets\diabetes.csv')

# Assign the encoded Dataframe to X
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression(solver='lbfgs', max_iter=1000)

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the data
logreg_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

Tuned Logistic Regression Parameters: {'C': 0.006105402296585327}
Best score is 0.7734742381801205


### RandomizedSearchCV

In [33]:
# Hyperparameter tuning with RandomizedSearchCV with diabetes dataset

# Import necessary modules
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

# Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}

# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()

# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)

# Fit it to the data
tree_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))

Tuned Decision Tree Parameters: {'criterion': 'gini', 'max_depth': 3, 'max_features': 4, 'min_samples_leaf': 1}
Best score is 0.7448094389270861


# Basics of Machine Learning Pipeline

<font color='#C0B3A0'>As we've seen before all this model follow a similar procedure or steps to predict the model below is the general overview of how a basic machine learning pipeline will work
1) Data Loading: Loading datasets either CSV, JSON, txt etc

2) Data Cleaning and Preprocessing: You can clean and preprocess the data to make it ready for analysis. This can include tasks such as handling missing values, removing outliers, converting data types, normalizing data, and more.

3) Exploratory Data Analysis (EDA): Start by performing EDA on the data, and looking for patterns and relationships between the different variables. Create visualizations, such as histograms, box plots, scatter plots, and heat maps, to better understand the distribution of the data and identify any outliers. This will give you a better understanding of the data and help you identify any potential problems or issues.

4) Data visualization: Create data visualizations to represent patterns and relationships in the data. Use different types of charts to represent different aspects of the data, such as bar charts for categorical data and scatter plots for continuous data. You can also create interactive visualizations using libraries like plotly, bokeh, etc.

5) Feature Selection: Use feature selection techniques, such as linear discriminant analysis (LDA) and principal component analysis (PCA), to identify which variables are most important in making prediction.

6) Correlation Analysis: Perform correlation analysis to find out the correlation between different variables and the target columns. This will help you identify which variables are the most important predictors of mental wellbeing and life satisfaction.

7) Clustering: This step falls under the category of !!Unsupervised learning!! and Data Exploration. It's used to group similar data points together by creating a tree-like structure called a dendrogram.

8) Predictive Modeling: Use the data to train machine learning models to predict one of the columns. You can try different models, such as linear regression, decision trees, random forests, and neural networks, and compare their performance.

</font>

