# Supervised Unsupervised Ensemble Machine Learning (SUEML)

## Documentation

### Complete Machine Learning Steps From ChatGPT 3.5 and recently released limited free 4.os (steps created 5/22/24)

**Problem Definition**: Clearly define the problem you want to solve. Understand the business context and the specific question you need to answer using ML.

**Data Collection**: Gather relevant data from various sources. Ensure that the data is representative of the problem you're addressing.

**Data Exploration**: Perform exploratory data analysis (EDA) to understand the data's structure, patterns, and relationships. Use visualizations and statistical methods.

**Data Cleaning**: Handle missing values, outliers, and inconsistencies. Ensure data quality by correcting errors and standardizing formats.

**Feature Engineering**: Create new features from existing data that can help the model learn better. This includes transformations, aggregations, and encoding categorical variables.

**Data Splitting**: Split the data into training and testing sets. Typically, an 80-20 split is used, but this can vary based on the dataset size and problem.

**Model Selection**: Choose appropriate algorithms for your problem (e.g., regression, classification, clustering). Consider simplicity, interpretability, and performance.

**Model Training**: Train the model on the training dataset. Optimize the model parameters to improve its performance.

**Model Evaluation**: Evaluate the model's performance using metrics appropriate for your problem (e.g., accuracy, precision, recall, F1 score for classification). Use the testing set for this purpose.

**Model Tuning**: Fine-tune hyperparameters using techniques like grid search or random search. Cross-validation helps ensure the model's robustness.

**Model Interpretation**: Understand and interpret the model's predictions. Use techniques like feature importance and SHAP values to explain the model.

**Model Deployment**: Deploy the model to a production environment where it can start making predictions on new data. Ensure it integrates well with existing systems.

**Monitoring and Maintenance**: Continuously monitor the model's performance over time. Retrain the model as necessary to handle new data and changing patterns.

**Documentation and Reporting**: Document the entire process, including data sources, preprocessing steps, model choices, and performance metrics. Prepare reports for stakeholders.



### Learn PCA
1. **Standardize Your Data**: PCA is sensitive to the scale of the data. Always standardize or normalize your features before applying PCA.
    ```python
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    ```

2. **Choose the Right Number of Components**: Use explained variance ratio to decide the number of principal components. Aim to keep 95-99% of the variance.
    ```python
    pca = PCA()
    pca.fit(X_scaled)
    explained_variance = np.cumsum(pca.explained_variance_ratio_)
    plt.plot(explained_variance)
    plt.xlabel('Number of Components')
    plt.ylabel('Explained Variance')
    plt.title('Explained Variance vs Number of Components')
    plt.show()
    ```

3. **Interpreting PCA Components**: Principal components are linear combinations of the original features. Interpretation might be difficult, but understanding which features contribute most to each component can give insights.
    ```python
    pd.DataFrame(pca.components_, columns=X.columns, index=[f'PC{i+1}' for i in range(pca.n_components_)])
    ```

4. **Avoid Over-Reduction**: Reducing dimensions too much can lead to loss of significant information. Balance between dimensionality reduction and preserving variance.
    ```python
    pca = PCA(n_components=5)  # Select a reasonable number of components
    X_pca = pca.fit_transform(X_scaled)
    ```

5. **Model Compatibility**: PCA-transformed data can be used with most models, but ensure the assumptions of the models are still met with the transformed data. For example, linear models might benefit from PCA by handling multicollinearity, but tree-based models (like Random Forest) might not gain as much since they can handle multicollinearity naturally.

6. **Cross-Validation**: Use cross-validation to ensure the robustness of the PCA and model pipeline.
    ```python
    from sklearn.model_selection import cross_val_score
    scores = cross_val_score(model, X_pca, y, cv=5)
    print("Cross-Validation Scores:", scores)
    ```

7. **Pipeline Integration**: Integrate PCA in a pipeline to streamline the process and avoid data leakage.
    ```python
    from sklearn.pipeline import Pipeline
    from sklearn.ensemble import RandomForestClassifier

    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('pca', PCA(n_components=5)),
        ('model', RandomForestClassifier())
    ])

    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    ```

By following these steps and considerations, you can effectively incorporate PCA into your machine learning workflow and improve the performance of your models.

## Import libraries, set settings
before we work on the project, we want to know the problem statement 
and what we are looking for as for this template, we are using to primarily
understand the code

In [63]:
# import pandas, numpy, matplotlib, and seaborn libraries
#dataframe
import pandas as pd
#numerical and statistical
import numpy as np
#visualization
import matplotlib.pyplot as plt
#visualization extension
import seaborn as sns                   
# Suppress FutureWarning messages
import warnings             
warnings.simplefilter(action='ignore', category=FutureWarning) 
# Ensure all numbers are displayed as regular numeric format
pd.set_option('display.float_format', lambda x: '%.0f' % x)

## EDA
Load Dataset

In [65]:
# load data and analyze data
df = pd.read_csv('example.csv')

In [66]:
# Display columns, datatype, count, null count, 
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   checking_balance      1000 non-null   object
 1   months_loan_duration  1000 non-null   int64 
 2   credit_history        1000 non-null   object
 3   purpose               1000 non-null   object
 4   amount                1000 non-null   int64 
 5   savings_balance       1000 non-null   object
 6   employment_duration   1000 non-null   object
 7   percent_of_income     1000 non-null   int64 
 8   years_at_residence    1000 non-null   int64 
 9   age                   1000 non-null   int64 
 10  other_credit          1000 non-null   object
 11  housing               1000 non-null   object
 12  existing_loans_count  1000 non-null   int64 
 13  job                   1000 non-null   object
 14  dependents            1000 non-null   int64 
 15  phone                 1000 non-null   o

None

Check for null values, unique, and duplicates

In [75]:
# Calculate the number and percentage of null values for each column
null_counts = df.isnull().sum()
null_percentages = (df.isnull().sum() / len(df)) * 100

# Combine the null counts and percentages into a single DataFrame
null_info = pd.DataFrame({
    'Null #': null_counts,
    'Null %': null_percentages,
    'Unique #': df.nunique(),
    'Duplicate #': df.apply(lambda x: x.duplicated().sum()),
})

# Display the combined information
display(null_info)

Unnamed: 0,Null #,Null %,Unique #,Duplicate #
checking_balance,0,0,4,996
months_loan_duration,0,0,33,967
credit_history,0,0,5,995
purpose,0,0,6,994
amount,0,0,921,79
savings_balance,0,0,5,995
employment_duration,0,0,5,995
percent_of_income,0,0,4,996
years_at_residence,0,0,4,996
age,0,0,53,947


See head, tail, sample

In [68]:
# head to see top five of the dataset
df.head(5)

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_duration,percent_of_income,years_at_residence,age,other_credit,housing,existing_loans_count,job,dependents,phone,default
0,< 0 DM,6,critical,furniture/appliances,1169,unknown,> 7 years,4,4,67,none,own,2,skilled,1,yes,no
1,1 - 200 DM,48,good,furniture/appliances,5951,< 100 DM,1 - 4 years,2,2,22,none,own,1,skilled,1,no,yes
2,unknown,12,critical,education,2096,< 100 DM,4 - 7 years,2,3,49,none,own,1,unskilled,2,no,no
3,< 0 DM,42,good,furniture/appliances,7882,< 100 DM,4 - 7 years,2,4,45,none,other,1,skilled,2,no,no
4,< 0 DM,24,poor,car,4870,< 100 DM,1 - 4 years,3,4,53,none,other,2,skilled,2,no,yes


In [69]:
# tail to see last five of the dataset
df.tail(5)

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_duration,percent_of_income,years_at_residence,age,other_credit,housing,existing_loans_count,job,dependents,phone,default
995,unknown,12,good,furniture/appliances,1736,< 100 DM,4 - 7 years,3,4,31,none,own,1,unskilled,1,no,no
996,< 0 DM,30,good,car,3857,< 100 DM,1 - 4 years,4,4,40,none,own,1,management,1,yes,no
997,unknown,12,good,furniture/appliances,804,< 100 DM,> 7 years,4,4,38,none,own,1,skilled,1,no,no
998,< 0 DM,45,good,furniture/appliances,1845,< 100 DM,1 - 4 years,4,4,23,none,other,1,skilled,1,yes,yes
999,1 - 200 DM,45,critical,car,4576,100 - 500 DM,unemployed,3,4,27,none,own,1,skilled,1,no,no


In [70]:
# sample to see random 5 samples from dataset
df.sample(5)

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_duration,percent_of_income,years_at_residence,age,other_credit,housing,existing_loans_count,job,dependents,phone,default
384,unknown,30,poor,business,4272,100 - 500 DM,1 - 4 years,2,2,26,none,own,2,unskilled,1,no,no
803,unknown,12,critical,furniture/appliances,976,unknown,> 7 years,4,4,35,none,own,2,skilled,1,no,no
66,unknown,12,good,furniture/appliances,2171,< 100 DM,< 1 year,2,2,29,bank,own,1,skilled,1,no,no
864,unknown,10,good,furniture/appliances,2210,< 100 DM,1 - 4 years,2,2,25,bank,rent,1,unskilled,1,no,yes
164,unknown,36,good,car,909,500 - 1000 DM,> 7 years,4,4,36,none,own,1,skilled,1,no,no


display shape and describe

In [71]:
# shape of the data in dataframe
display(df.shape) # is in tuples (rows, columns)

# describe statistics
# we use transpose to avoid overlap and also help on readability
display(df.describe().T)

# 17 columns may mean that we need to drop some columns before training 
# to prevent overfitting which we will see later
# 1000 samples is pretty low count is likely to underfit but for the simplicity its dataset is used

(1000, 17)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
months_loan_duration,1000,21,12,4,12,18,24,72
amount,1000,3271,2823,250,1366,2320,3972,18424
percent_of_income,1000,3,1,1,2,3,4,4
years_at_residence,1000,3,1,1,2,3,4,4
age,1000,36,11,19,27,33,42,75
existing_loans_count,1000,1,1,1,1,1,2,4
dependents,1000,1,0,1,1,1,1,2


In [72]:
# why do we want to focus on target? We will have better idea of
# the relationship of data between target and predictors
# df.target.value_counts(); lets say the target is predicting age
print(f'unique age: {df.age.nunique()}\n{df.age.value_counts()}')
# here we see that the age is ranging from 1 to 70 meaning newborn to elderly

unique age: 53
age
27    51
26    50
23    48
24    44
28    43
25    41
30    40
35    40
36    39
31    38
29    37
32    34
33    33
34    32
37    29
22    27
40    25
38    24
42    22
39    21
46    18
43    17
47    17
44    17
41    17
45    15
20    14
21    14
49    14
50    12
48    12
54    10
57     9
52     9
51     8
55     8
63     8
61     7
53     7
60     6
65     5
58     5
64     5
66     5
74     4
67     3
68     3
56     3
59     3
75     2
19     2
62     2
70     1
Name: count, dtype: int64


In [32]:
# when there is target aka what we want to predict aka supervised learning 
# everything is revolved around the target including visualization, which will be shown later
# df.rename(), renaming column may help in readability
df.rename(columns={'age': 'aged'}, inplace=True)
df.aged

0      67
1      22
2      49
3      45
4      53
       ..
995    31
996    40
997    38
998    23
999    27
Name: aged, Length: 1000, dtype: int64

In [15]:
# isna/isnull check for null for each samples, sum will sum all null for each columns
df.isna().sum()

checking_balance        0
months_loan_duration    0
credit_history          0
purpose                 0
amount                  0
savings_balance         0
employment_duration     0
percent_of_income       0
years_at_residence      0
age                     0
other_credit            0
housing                 0
existing_loans_count    0
job                     0
dependents              0
phone                   0
default                 0
dtype: int64

### 1. Imputation (Filling Missing Values)

In [76]:
from sklearn.impute import SimpleImputer

# Assuming you might want to impute missing values in 'checking_balance' column with the mode
df['checking_balance'].fillna(df['checking_balance'].mode()[0], inplace=True)

# Impute numerical columns with the mean (example with `amount`)
imputer = SimpleImputer(strategy='mean')
df['amount'] = imputer.fit_transform(df[['amount']])

In [11]:
# replace col_name with actual column name to fill in the mode value
# we typically want to fill as mode to handle outliers
# or mean value; once you understand the fundamentals, you can figure out whether to mode, mean, leaveout, etc
# df['col_name'].fillna(data['col_name'].mode()[0], inplace=True) # impute values
# because there is no missing values, there is no need to impute values
# imputing falls under preprocessing step, meaning data cleaning and preparing for data
				 # count row/col# drop data
# df = df.drop_duplicates(subset=[‘A’])		 # drops duplicates 
# df = df.dropna()    				 # drops NaN values
# df.dropna(‘col_name’, axis=1, inplace=True) 	 # drops missing data column, axis = 0 for rows
# df = df.drop([‘col1_name’, ‘col2_name’’], axis=1)	 # drops columns/rows
# df.count()	
# SimpleImputer() to replace NaN
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy ='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns) # fit & transform
# combine DataFrames
combined_df = pd.concat(df, ignore_index=True)
combined_df.to_csv('combined.csv', index=False)
# Places columns into categorical and numerical dictionary
# separate num and obj
cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
num_cols = df.select_dtypes(include=['number']).columns.tolist()
# dictionary of num and obj
columns_dict = {
    'Categorical Columns': cat_cols,
    'Numerical Columns': num_cols
}
# detect outliers
# Function to find outliers using the IQR method
def find_outliers(df):
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    lower_outliers = df < lower_bound
    upper_outliers = df > upper_bound
    combined_outliers = lower_outliers | upper_outliers
    return lower_outliers, upper_outliers, combined_outliers, lower_bound, upper_bound

# Find outliers
lower_outliers, upper_outliers, combined_outliers, lower_bound, upper_bound = find_outliers(df)

# Count number of outliers for each numerical column
lower_outlier_counts = lower_outliers.sum()
upper_outlier_counts = upper_outliers.sum()
combined_outlier_counts = combined_outliers.sum()

# Create a DataFrame to display all outlier counts and ranges together
outliers_summary = pd.DataFrame({
    'Lower Outliers': lower_outlier_counts,
    'Upper Outliers': upper_outlier_counts,
    'Combined Outliers': combined_outlier_counts,
    'Lower Bound': lower_bound,
    'Upper Bound': upper_bound
})

# Print the summary table
print(outliers_summary.T)
# Grouping Data
sales.groupby('type')['sold'].agg([max, sum])
category_sentiment_counts = merge_df.groupby('Category')['Sentiment'].value_counts().unstack()
# cross tabulation between categoricals (table of similarity)
crosstab = pd.crosstab(df[cat1], df['cat2'])

# display univariate using boxplot and histplot 
def plot_numerical_columns(df):
    # Separate columns into categorical and numerical
    # cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    # num_cols = df.select_dtypes(include=['number']).columns.tolist()

    # Plotting each numerical column
    for i in range(0, len(num_cols), 2):
        plt.figure(figsize=(16, 6))

        # Histogram for first numerical column
        plt.subplot(1, 2, 1)
        sns.histplot(data=df, x=num_cols[i], kde=True)
        plt.title(f'Histogram of {num_cols[i]}')

        # Box plot for first numerical column
        plt.subplot(1, 2, 2)
        sns.boxplot(data=df, x=num_cols[i])
        plt.title(f'Boxplot of {num_cols[i]}')

        # Check if there is a second numerical column to plot
        if i + 1 < len(num_cols):
            # Histogram for second numerical column
            plt.figure(figsize=(16, 6))
            plt.subplot(1, 2, 1)
            sns.histplot(data=df, x=num_cols[i + 1], kde=True)
            plt.title(f'Histogram of {num_cols[i + 1]}')

            # Box plot for second numerical column
            plt.subplot(1, 2, 2)
            sns.boxplot(data=df, x=num_cols[i + 1])
            plt.title(f'Boxplot of {num_cols[i + 1]}')

        plt.tight_layout()
        plt.show()

plot_numerical_columns(df)
# Bins, countplot
sns.countplot(data=df, y='crop') 	# sns.countplot for categorical 'crop' column
~~~~~~~
plt.figure(figsize=(8,6))
# Convert 'Age' to categorical bins
df['Age_Bins'] = pd.cut(df['Age'], bins=range(0, 101, 5), right=False)

# Plot the barplot with the binned ages
sns.barplot(data=df, x='Age_Bins', y='Personal_Loan')
plt.title('Personal Loan and Age')
plt.xticks(rotation=90);
~~~~~~
plt.figure(figsize=(8,6))
df['Education_Bins'] = df['Education'].map({1: 'Undergrad', 2: 'Graduate', 3: 'Advanced/Professional'})
sns.barplot(data = df, x = 'Education_Bins', y = 'Personal_Loan')



# display dictionary of bivariate (focus on visualizing target if there is)
sns.pairplot(data = df[num_cols])	# grid of scatterplot
# heat map
plt.figure(figsize=(10, 8))
sns.heatmap(df[num_cols].corr(), annot=True, cmap='coolwarm') 	#numerical columns only
plt.title('Correlation Matrix Heatmap');
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Model~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# classification report: table of metrics for classification (below)
# Accuracy:	% of correct prediction
# Macro avg:	avg for each class
# Weighted avg:avg weighted by number of instances in each class
# Precision:	proportion of true positive among all predicted positive
# Recall:	proportion of true positive among all real positives
# F1-Score:	harmonic mean of precision and recall
# Support:	number of instances each class in the dataset

# regression metrics
# MSE:			
# RMSE:
# R2:
# Adjusted R2:
# MAPE:
# import sklearn library (target classification)
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import numpy as np

# Separate features and target
X = df[['Income', 'CCAvg', 'CD_Account']]
y = df['Personal_Loan']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Identify categorical and numerical columns
categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns
numerical_cols = X_train.select_dtypes(include=['number']).columns

# Preprocessing: OneHotEncoder for categorical features, StandardScaler for numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
        ('num', StandardScaler(), numerical_cols)
    ],
    remainder='passthrough'
)

# PCA with 5 components (adjust as needed)
pca = PCA()

# Define the models and their parameter grids
models_and_params = {
    'RandomForest': {
        'model': RandomForestClassifier(random_state=42),
        'params': {
            'classifier__n_estimators': [50, 100, 200],
            'classifier__max_depth': [None, 10, 20, 30],
            'classifier__min_samples_split': [2, 5, 10],
            'classifier__min_samples_leaf': [1, 2, 4]
        }
    },
    'LogisticRegression': {
        'model': LogisticRegression(random_state=42, max_iter=1000),
        'params': {
            'classifier__C': [0.01, 0.1, 1, 10, 100],
            'classifier__solver': ['lbfgs', 'liblinear']
        }
    },
    'SVM': {
        'model': SVC(random_state=42),
        'params': {
            'classifier__C': [0.01, 0.1, 1, 10, 100],
            'classifier__kernel': ['linear', 'rbf']
        }
    }
}

# Function to evaluate models with cross-validation
def evaluate_model(model, X, y):
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    cv_scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
    print("Cross-Validation Scores:", cv_scores)
    print("Mean Cross-Validation Score:", np.mean(cv_scores))

# Train and evaluate each model
for model_name, config in models_and_params.items():
    print(f"--- {model_name} ---")
    model = config['model']
    param_grid = config['params']
    
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('pca', pca),
        ('classifier', model)
    ])
    
    evaluate_model(pipeline, X_train, y_train)

    # Hyperparameter tuning using GridSearchCV
    grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
    grid_search.fit(X_train, y_train)
    
    print("Best Parameters:", grid_search.best_params_)
    print("Best Cross-Validation Score:", grid_search.best_score_)

    # Evaluate the best model on the test set
    best_model = grid_search.best_estimator_
    y_pred_best = best_model.predict(X_test)
    
    print("Best Model Classification Report:")
    print(classification_report(y_test, y_pred_best))
    print("Best Model Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred_best))
    print("Best Model Accuracy Score:")
    print(accuracy_score(y_test, y_pred_best))
    print("\n")

# import sklearn library (target regression) 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error
import matplotlib.pyplot as plt
import seaborn as sns

# Load your dataset (ensure to load it as needed, here assuming df is your DataFrame)
# df = pd.read_csv('your_dataset.csv')

# Separate target variable 'p' and features
X = df.drop(columns=['P'])
y = df['P']

# Identify numerical and categorical columns
num_cols = X.select_dtypes(include=['number']).columns.tolist()
cat_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing for numerical data
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical data
cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_cols),
        ('cat', cat_transformer, cat_cols)
    ])

# Define the models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42)
}

# Function to calculate Adjusted R-squared
def adjusted_r2(r2, n, k):
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))

# Train and evaluate each model
results = {}

for model_name, model in models.items():
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)
    ])
    
    # Train the model
    pipeline.fit(X_train, y_train)
    
    # Make predictions
    y_pred_train = pipeline.predict(X_train)
    y_pred_test = pipeline.predict(X_test)
    
    # Evaluate the model
    mse_train = mean_squared_error(y_train, y_pred_train)
    mse_test = mean_squared_error(y_test, y_pred_test)
    rmse_train = np.sqrt(mse_train)
    rmse_test = np.sqrt(mse_test)
    r2_train = r2_score(y_train, y_pred_train)
    r2_test = r2_score(y_test, y_pred_test)
    adj_r2_train = adjusted_r2(r2_train, X_train.shape[0], X_train.shape[1])
    adj_r2_test = adjusted_r2(r2_test, X_test.shape[0], X_test.shape[1])
    mape_train = mean_absolute_percentage_error(y_train, y_pred_train)
    mape_test = mean_absolute_percentage_error(y_test, y_pred_test)
    
    results[model_name] = {
        'Train MSE': mse_train,
        'Test MSE': mse_test,
        'Train RMSE': rmse_train,
        'Test RMSE': rmse_test,
        'Train R2': r2_train,
        'Test R2': r2_test,
        'Train Adjusted R2': adj_r2_train,
        'Test Adjusted R2': adj_r2_test,
        'Train MAPE': mape_train,
        'Test MAPE': mape_test,
    }

# Display the results
results_df = pd.DataFrame(results) # add '.T' to transpose
print(results_df)

# Visualize the feature importances for the Random Forest model (if applicable)
best_model = models['Random Forest']
best_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', best_model)
])

best_pipeline.fit(X_train, y_train)
importances = best_model.feature_importances_
features = num_cols + list(best_pipeline.named_steps['preprocessor'].transformers_[1][1].named_steps['onehot'].get_feature_names_out(cat_cols))
feature_importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title('Feature Importances for Random Forest')
plt.show()







SyntaxError: invalid character '‘' (U+2018) (4256728799.py, line 4)