# Machine Learning

__Machine learning__ is a method of _data analysis_ that automates
_analytical model building_.
It is a branch of _artificial intelligence_ based on the idea that systems
can _learn from data_, _identify patterns_ and _make decisions_ with
minimal human intervention.

## Sci-kit Learn (SKlearn, Scipy, Numpy)

__Scikit-learn__ is a _Python package_ that provides a wide range of _machine learning algorithms_ and tools. 
It is built on top of _NumPy_, _SciPy_, and _Matplotlib_, and is designed to be simple and efficient for data analysis and modeling.

__Scikit-learn__ offers various modules for tasks such as _classification_, _regression_, _clustering_, _dimensionality reduction_, and _model selection_.
It also provides utilities for _preprocessing data_, _evaluating models_, and _handling datasets_.

With its extensive documentation and user-friendly interface, __Scikit-learn__ is widely used in the field of machine learning and data science.

In [None]:
#!pip install scikit-learn
import numpy as np
import pandas as pd
import sklearn

In [None]:
# spplitting the data into training and testing data
from sklearn.model_selection import train_test_split

# importing the model
football_df = pd.read_csv('csv-files/football_data.csv')

# Assuming football_df is your DataFrame and it has a 'target' column
# columns_list is a list of column names to be used as features
columns_list = ['feature1', 'feature2', 'feature3']  # Replace with actual column names

# Splitting the data into features and target
X = football_df[columns_list]
y = football_df['target']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### K-Nearest Neighbors

__K-Nearest Neighbors__ is a simple algorithm that _stores all available
cases_ and _classifies_ new cases based on a similarity measure.

It is a type of _instance-based learning_, or _lazy learning_, where the
function is only approximated locally and all computation is deferred
until function evaluation.

In [None]:
# Classification of the data
from sklearn.metrics import accuracy_score
# Feature scaling for better performance of KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Creating the KNN model
knn = KNeighborsClassifier(n_neighbors=5)  # You can adjust the number of neighbors
# Fitting the model with the training data
knn.fit(X_train_scaled, y_train)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Assuming new_input is a new read_csv
# Predicting the class for the new input
predicted_class = knn.predict(new_input_scaled)

print(f"The predicted class for the new input is: {predicted_class}")

In [None]:
# Regression of the data
# Feature scaling for better performance of KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Creating the KNN model for regression
knn_regressor = KNeighborsRegressor(n_neighbors=5)  # You can adjust the number of neighbors

# Fitting the model with the training data
knn_regressor.fit(X_train_scaled, y_train)

# Assuming new_input is a new data point you want to predict
# new_input should be a list of values corresponding to columns_list
new_input_scaled = scaler.transform([new_input])  # Replace new_input with actual data
predicted_target = knn_regressor.predict(new_input_scaled)

print(predicted_target)

### Linear Regression with Least Squares

__Linear regression__ is a type of _regression analysis_ used for predicting the value of a _continuous dependent variable_. It works by finding the _line that best fits the data_.

_Least squares_ is a method for finding the _best-fitting_ line by __minimizing__ the _sum of the squared differences_ between the predicted and actual values.

In [None]:
# Creating the Linear Regression model
linear_regressor = LinearRegression()

# Fitting the model with the training data
linear_regressor.fit(X_train, y_train)

# Assuming new_input is a new data point you want to predict
# new_input should be a list of values corresponding to columns_list
new_input = [value1, value2, value3]  # Replace value1, value2, value3 with actual values

# Predicting the target for the new input
predicted_target = linear_regressor.predict([new_input])

print(f"The predicted target for the new input is: {predicted_target[0]}")

### Regularization with Ridge and Lasso

__Ridge regression__ (_L2_) and __Lasso regression__ (_L1_) are a type of _linear regression_ that includes a _penalty_ term to __prevent overfitting__. They work by adding a _regularization term_ to the least squares objective function

In [None]:
# implementing Rigde Regression (L2 regularization)
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

# Creating the Ridge Regression model
# alpha is the regularization strength; larger values specify stronger regularization.
ridge_regressor = Ridge(alpha=1.0)

# Fitting the model with the training data
ridge_regressor.fit(X_train, y_train)

# Making predictions on the test set
y_pred = ridge_regressor.predict(X_test)

# Calculating the mean squared error of the predictions
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

In [None]:
# implementing Lasso Regression (L1 regularization)
# Creating the Lasso Regression model
# alpha is the regularization strength; larger values specify stronger regularization.
lasso_regressor = Lasso(alpha=1.0)

# Fitting the model with the training data
lasso_regressor.fit(X_train, y_train)

# Making predictions on the test set
y_pred = lasso_regressor.predict(X_test)

# Calculating the mean squared error of the predictions
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# To understand feature sensitivity, you can look at the coefficients
coefficients = pd.DataFrame(lasso_regressor.coef_, X.columns, columns=['Coefficient'])
print(coefficients)

### Polynomial Regression

__Polynomial regression__ is a type of r_egression analysis_ that models
the _relationship_ between the independent and dependent variables as
an $nth-degree$ _polynomial_. It can capture _non-linear relationships_ between the variables.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Assuming football_df is your DataFrame and it has a 'target' column
# columns_list is a list of column names to be used as features
columns_list = ['feature1', 'feature2', 'feature3']  # Replace with actual column names

# Splitting the data into features and target
X = football_df[columns_list]
y = football_df['target']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Transforming the features into polynomial features
degree = 2  # Degree of the polynomial features
poly_features = PolynomialFeatures(degree=degree)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Creating the Linear Regression model
linear_regressor = LinearRegression()

# Fitting the model with the polynomial features and the training data
linear_regressor.fit(X_train_poly, y_train)

# Making predictions on the test set
y_pred = linear_regressor.predict(X_test_poly)

# Calculating the mean squared error of the predictions
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

### Logistic Regression

__Logistic regression__ is a type of _regression analysis_ used for predicting the outcome of a _categorical dependent variable_.
It is used for __binary classification__ tasks, where the output is a
probability between $0$ and $1$.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Creating the Logistic Regression model
logistic_regressor = LogisticRegression()

# Fitting the model with the training data
logistic_regressor.fit(X_train, y_train)

# Making predictions on the test set
y_pred = logistic_regressor.predict(X_test)

# Calculating the accuracy of the predictions
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

### Cross-Validation

__Cross-validation__ is a technique for _assessing the performance_ of a
model. It involves _splitting_ the data into multiple subsets, training the model on some subsets, and evaluating it on others.

__Cross-validation__ helps to _reduce overfitting_ and provides a more
accurate estimate of the model’s performance.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import pandas as pd

# Create a logistic regression model
model = LogisticRegression()

# Perform cross-validation
scores = cross_val_score(model, X, y, cv=5)  # cv=5 for 5-fold cross-validation

# Print the accuracy for each fold
print("Accuracy for each fold: ", scores)

# Print the mean accuracy of all 5 folds
print("Mean cross-validation accuracy: ", scores.mean())

### Encoding

__One-hot encoding__ is a technique for _converting_ _categorical_ variables into _numerical_ variables.

It creates a _binary vector_ for each _category_, with a $1$ for the
category and $0$s for all other categories

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Creating a fake DataFrame
data = {
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
    'Size': ['S', 'M', 'L', 'XL', 'S'],
    'Price': [10, 15, 20, 25, 10]
}
df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Applying OneHotEncoder
encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(df[['Color', 'Size']]) # Encoding 'Color' and 'Size' columns
encoded_data_dense = encoded_data.toarray()

print(encoded_data_dense)
print(encoder.get_feature_names_out(['Color', 'Size']))

# Creating a DataFrame with the encoded data
encoded_df = pd.DataFrame(encoded_data_dense, columns=encoder.get_feature_names_out(['Color', 'Size']))

# Concatenating the encoded columns with the original DataFrame (excluding the original 'Color' and 'Size' columns)
final_df = pd.concat([df.drop(['Color', 'Size'], axis=1), encoded_df], axis=1)

# Display the final DataFrame after one-hot encoding
print("\nDataFrame after OneHotEncoding:")
print(final_df)

# Supervised Machine Learning Algorithms

### Random Forest

__Random forest__ is an _ensemble learning_ method that combines
_multiple decision trees_ to create a strong predictive model.

It works by building _multiple trees_ and averaging their predictions to
_reduce overfitting_.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Load the dataset
df = pd.read_csv('student-por.csv')

# Preprocess the data
# Assuming the target variable is 'Grade' and it's categorical
le = LabelEncoder()

print(df['G3'].value_counts())
df['school'] = le.fit_transform(df['school'])
df['sex'] = le.fit_transform(df['sex'])
df['address'] = le.fit_transform(df['address'])
df['famsize'] = le.fit_transform(df['famsize'])
df['Pstatus'] = le.fit_transform(df['Pstatus'])
df['Mjob'] = le.fit_transform(df['Mjob'])
df['Fjob'] = le.fit_transform(df['Fjob'])
df['reason'] = le.fit_transform(df['reason'])
df['guardian'] = le.fit_transform(df['guardian'])

X = df.drop(['G3', 'schoolsup', 'famsup', 'famrel', 'paid', 'romantic', 'activities', 'higher', 'internet', 'nursery'], axis=1)  # Features
y = df['G3']  # Target

# Normalize features
#scaler = StandardScaler()
#X = scaler.fit_transform(X)


#print(y.value_counts())  # Display the distribution of the target variable

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")


    school sex  age address famsize Pstatus  Medu  Fedu      Mjob      Fjob  \
0       GP   F   18       U     GT3       A     4     4   at_home   teacher   
1       GP   F   17       U     GT3       T     1     1   at_home     other   
2       GP   F   15       U     LE3       T     1     1   at_home     other   
3       GP   F   15       U     GT3       T     4     2    health  services   
4       GP   F   16       U     GT3       T     3     3     other     other   
..     ...  ..  ...     ...     ...     ...   ...   ...       ...       ...   
644     MS   F   19       R     GT3       T     2     3  services     other   
645     MS   F   18       U     LE3       T     3     1   teacher  services   
646     MS   F   18       U     GT3       T     1     1     other     other   
647     MS   M   17       U     LE3       T     3     1  services  services   
648     MS   M   18       R     LE3       T     3     2  services     other   

     ... famrel freetime  goout  Dalc  Walc health 

### Gradient Boosted Decision Trees

__Gradient boosted decision trees__ are an _ensemble learning_ method
that combines _multiple decision trees_ and _gradient descedent
optimization_ to create a strong predictive model.

They work by building _trees sequentially_, with each tree _correcting the
errors_ of the previous trees.

In [None]:
# !pip install xgboost

import xgboost as xgb

model = xgb.XGBClassifier(eval_metric='mlogloss', objective='multi:softmax')
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")


In [None]:
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import LabelEncoder

# Load the dataset
df = pd.read_csv('datasets_kaggle/insurance.csv')

# Preprocess the data
# Assuming 'charges' is the target variable
X = df.drop('charges', axis=1)
y = df['charges']

# Encode categorical variables
for column in X.select_dtypes(include=['object']).columns:
    X[column] = LabelEncoder().fit_transform(X[column])

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Build the XGBoost model
model = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 1.0,
                max_depth = 20, alpha = 10, n_estimators = 20)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)

print(f"MSE: {mse}")

### Neural Networks

__Neural networks__ are a type of _machine learning_ model inspired by
the _human brain_.

They consist of _layers of interconnected nodes_ that process input data
and produce output data.

In [None]:
# !pip install tensorflow
# !pip install keras

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical

# Load the dataset
df = pd.read_csv('datasets_kaggle/Student_performance_data _.csv')

# Preprocess the data
# Assuming 'GradeClass' is the target variable and it's categorical
X = df.drop('GradeClass', axis=1).values
y = LabelEncoder().fit_transform(df['GradeClass'].values)
y = to_categorical(y)  # Convert labels to one-hot encoding

# Normalize features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Build the neural network
model = Sequential()
model.add(Dense(64, input_shape=(X_train.shape[1],), activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(y_train.shape[1], activation='softmax'))  # Output layer

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=100, batch_size=10, verbose=0)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)

print(f"Accuracy: {accuracy}")

# Model Evaluation

### Confusion Matrices

A __confusion matrix__ is a table that _summarizes the performance_ of a
classification model.

It shows the number of _true positives_, _true negatives_, _false positives_,
and _false negatives_.

In [None]:
from sklearn.metrics import confusion_matrix

# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")
print(cm)

### Basic Metrics

- __Accuracy:__ The proportion of correct predictions.
- __Precision:__ The proportion of true positives among all positive
predictions.
- __Recall:__ The proportion of true positives among all actual positives.
- __F1 Score:__ The harmonic mean of precision and recall.

In [None]:
# Assuming cm is the confusion matrix obtained from the Random Forest output
TN, FP, FN, TP = cm.ravel()

# Calculate metrics
accuracy = (TP + TN) / (TP + TN + FP + FN)
precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1_score = 2 * (precision * recall) / (precision + recall)

# Print metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1_score}")

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Assuming y_true and y_pred are defined
# y_true = [...]  # True labels
# y_pred = [...]  # Predictions made by the Random Forest model

# Calculate metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average='binary')
recall = recall_score(y_true, y_pred, average='binary')
f1 = f1_score(y_true, y_pred, average='binary')

# Print metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

### Classifier Decision Metrics

- __ROC Curve:__ A plot of the true positive rate against the false positive rate.
- __Precision-Recall Curve:__ A plot of precision against recall.
- __AUC-ROC:__ The area under the ROC curve.
- __AUC-PR:__ The area under the precision-recall curve.

In [None]:
from sklearn.metrics import roc_curve, precision_recall_curve, auc
import matplotlib.pyplot as plt

# Assuming y_true and y_pred_proba are defined
# y_true = [...]  # True labels
# y_pred_proba = [...]  # Predicted probabilities for the positive class

# Calculate ROC Curve
fpr, tpr, thresholds_roc = roc_curve(y_true, y_pred_proba)
roc_auc = auc(fpr, tpr)

# Plot ROC Curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

# Calculate Precision-Recall Curve
precision, recall, thresholds_pr = precision_recall_curve(y_true, y_pred_proba)
pr_auc = auc(recall, precision)

# Plot Precision-Recall Curve
plt.figure()
plt.plot(recall, precision, color='blue', lw=2, label=f'Precision-Recall curve (area = {pr_auc:.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc="lower left")
plt.show()

### Regression Evaluation Metrics

- __Mean Squared Error:__ The average of the squared differences between the predicted and actual values.
- __Mean Absolute Error:__ The average of the absolute differences between the predicted and actual values.
- __R-Squared:__ The proportion of the variance in the dependent variable that is predictable from the independent variables.
- __Adjusted R-Squared:__ A modified version of R-squared that adjusts for the number of predictors in the model.
- __Root Mean Squared Error:__ The square root of the mean squared error.

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Assuming y_true and y_pred are defined
# y_true = [...]  # True values
# y_pred = [...]  # Predictions made by the XGBoost regressor

# Number of observations and predictors
n = len(y_true)  # Number of observations
p = model.n_features_in_

# Calculate metrics
mse = mean_squared_error(y_true, y_pred)
mae = mean_absolute_error(y_true, y_pred)
r_squared = r2_score(y_true, y_pred)
adjusted_r_squared = 1 - (1-r_squared) * (n-1) / (n-p-1)
rmse = np.sqrt(mse)

# Print metrics
print(f"MSE: {mse}")
print(f"MAE: {mae}")
print(f"R-squared: {r_squared}")
print(f"Adjusted R-squared: {adjusted_r_squared}")
print(f"RMSE: {rmse}")