# Selecting the Best model Algorithm with the high accuray using Pipeline

- Here we are applying different models on dataset and and compairing them which model is best fit and have the best accuracy on a dataset
- We are using different metrics to evaluate the model such as accuracy, precision, recall, f1
- We are using cross-validation to evaluate the model on unseen data and to prevent overfitting
- We are using grid search to find the best parameters for the model
- We are using different models such as logistic regression, decision tree, random forest, support vector machine
and linear regression model on a dataset
- A pipeline method is used to compair all these methods


### Using the Machine Learning library Scikit Learn as sklearn for the data compairing pupose 

***Import the libraries***

In [2]:
#import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

***Import the seaborn libraries***

In [3]:
#Import the sklearn libraries:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.metrics import accuracy_score, recall_score, f1_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer


***Import the dataset***

In [4]:
#Import the dataset of titanic:
df = sns.load_dataset('titanic')

In [5]:
#Check the dataset:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [6]:
#Split the data into the X & y:
X = df[['pclass', 'sex', 'age', 'fare', 'embarked']]
y = df['survived']

In [7]:
#Split the data into train_test_split:
X_test, X_train, y_test, y_train = train_test_split(X, y , test_size=0.2, random_state=42)

In [8]:
#Model list:
models = [
    ('Random Forest', RandomForestClassifier(random_state=42)),
    ('Gradient Boosting', GradientBoostingClassifier(random_state=42)),
    ('XGBooster', XGBClassifier()),
    ('Decision Tree', DecisionTreeClassifier(random_state=42)),
    ('Support Vector Machine', SVC())
]

best_model = None
best_accuracy = 0.0

for name, model in models:
    pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('Encoder', OneHotEncoder(handle_unknown='ignore')),
        ('model', model)
    ])

    # Perform the cross-validation
    scores = cross_val_score(pipeline, X_train, y_train, cv=5)

    # Perform the mean score:
    mean_accuracy = scores.mean()

    # Fit the pipeline:
    pipeline.fit(X_train, y_train)

    # Make a prediction
    y_pred = pipeline.predict(X_test)

    # Calculate the Accuracy:
    accuracy = accuracy_score(y_test, y_pred)

    # Perform Metrics:
    print(name)
    print(f'Mean Accuracy: {mean_accuracy:.2f}')
    print(f'Test Accuracy: {accuracy:.2f}')
    print()

    # Best model accuracy:
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model = pipeline
        print(f'Best Model: {best_model}')

Random Forest
Mean Accuracy: 0.78
Test Accuracy: 0.82

Best Model: Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                ('Encoder', OneHotEncoder(handle_unknown='ignore')),
                ('model', RandomForestClassifier(random_state=42))])
Gradient Boosting
Mean Accuracy: 0.78
Test Accuracy: 0.79

XGBooster
Mean Accuracy: 0.71
Test Accuracy: 0.79

Decision Tree
Mean Accuracy: 0.78
Test Accuracy: 0.78

Support Vector Machine
Mean Accuracy: 0.78
Test Accuracy: 0.79

