## 1. Intoduction to Machine Learning with Python

- Supervised Learning
- Unsupervised Learning

### 1.1 Supervised Learning
- Classification
- Regression

### 1.2 Unsupervised Learning
- Clustering

#### 1.1.1 Classification
- K-Nearest Neighbors
- Logistic Regression

#### 1.1.2 Regression
- Linear Regression
- Ridge Regression
- Lasso Regression

#### 1.1.3 Clustering
- K-Means
- Hierarchical Clustering

### Data Preprocessing
- Data Encoding
- Data imputation
- Data Normalization


**For Machine Learning our main aim is to predict the output based on the input data.**

## Preprocessor function

In [16]:
# import 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# import stats
from scipy import stats

In [17]:
## Create an automated script to fit and train ML models

def preprocess_data(data):
    # check for missing values and show columns with more 25% missing values
    # get missing values as a percentage
    missing_values = data.isnull().mean() * 100
    # drop columns with more than 25% missing values
    data = data.drop(columns=missing_values[missing_values > 25].index)
    # impute the other missing values
    # categorical columns
    cat_cols = data.select_dtypes(include='object').columns
    # numerical columns
    num_cols = data.select_dtypes(exclude='object').columns
    
    # create a pipeline for numerical columns
    numerical_pipeline = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())
    ])

    # create a pipeline for categorical columns
    cat_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy = 'most_frequent')),
        ('encoder', OrdinalEncoder())
    ])

    # create a column transformer
    from sklearn.compose import ColumnTransformer
    preprocessor = ColumnTransformer([
        ('num', numerical_pipeline, num_cols),
        ('cat', cat_pipeline, cat_cols)
    ])
    
    # fit and transform the data
    data = preprocessor.fit_transform(data)
    data = pd.DataFrame(data, columns = num_cols.tolist() + cat_cols.tolist())

    # remove outliers from the data
    # create a zscore object
    zscore = np.abs(stats.zscore(data))
    # remove outliers
    data = data[(zscore < 3).all(axis=1)]

    return data

In [18]:
# import pinguin data from seaborn
data = sns.load_dataset('penguins')
data.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [19]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [20]:
df = preprocess_data(data)
df.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,species,island,sex
0,-0.8870812,0.7877425,-1.422488,-0.565789,0.0,2.0,1.0
1,-0.813494,0.1265563,-1.065352,-0.503168,0.0,2.0,0.0
2,-0.6663195,0.4317192,-0.422507,-1.192003,0.0,2.0,0.0
3,-1.307172e-15,1.806927e-15,0.0,0.0,0.0,2.0,1.0
4,-1.328605,1.092905,-0.565361,-0.941517,0.0,2.0,0.0


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   bill_length_mm     344 non-null    float64
 1   bill_depth_mm      344 non-null    float64
 2   flipper_length_mm  344 non-null    float64
 3   body_mass_g        344 non-null    float64
 4   species            344 non-null    float64
 5   island             344 non-null    float64
 6   sex                344 non-null    float64
dtypes: float64(7)
memory usage: 18.9 KB


In [28]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.cluster import KMeans, AgglomerativeClustering

# create a function to train and test ML models
model_list = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'K Nearest Neighbors': KNeighborsRegressor(),
    'Kmeans': KMeans(),
    'Hierarchical Clustering': AgglomerativeClustering()
}

# let the user choose the model to train
def choose_model():
    print('Choose a model to train')
    for i, model in enumerate(model_list.keys()):
        print(f'{i+1}. {model}')
    choice = int(input('Enter the number of the model you want to train: '))
    model = list(model_list.values())[choice-1]
    return model

# let the user choose the target variable
def choose_target_variable(data):
    print('Choose the target variable')
    for i, col in enumerate(data.columns):
        print(f'{i+1}. {col}')
    choice = int(input('Enter the number of the target variable you want to predict: '))
    target = data.columns[choice-1]
    return target

# train the model
def train_model(model, target, data):
    X = data.iloc[:, :-1]
    y = data.iloc[:, -1]
    model.fit(X, y)
    return model

# evaluate the model
def evaluate_model(model, target, data):
    X = data.iloc[:, :-1]
    y = data.iloc[:, -1]
    y_pred = model.predict(X)
    return y.values, y_pred

# test our functions
model = choose_model()
target = choose_target_variable(df)
model = train_model(model, target, df)
y, y_pred = evaluate_model(model, target, df)

Choose a model to train
1. Linear Regression
2. Ridge Regression
3. Lasso Regression
4. K Nearest Neighbors
5. Kmeans
6. Hierarchical Clustering
Choose the target variable
1. bill_length_mm
2. bill_depth_mm
3. flipper_length_mm
4. body_mass_g
5. species
6. island
7. sex
