# Machine Learning in Scikit Learn Day-20



Written by: M.Danish Azeem\
Date: 01.20.2024\
Email: danishazeem365@gmail.com

# Best model selection

In [2]:

# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


df = sns.load_dataset("titanic")
X = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare']]
y = df['survived']
X = pd.get_dummies (X, columns=['sex'])
X.age.fillna(value = X['age'].mean(), inplace=True)


from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

models = [LogisticRegression(), SVC(), DecisionTreeClassifier(), RandomForestClassifier(), KNeighborsClassifier()]
model_names = ['Logistic Regression', 'SVM', 'Decision Tree', 'Random Forest', 'KNN']

models_scores = []
for model, model_name in zip(models, model_names):
    model.fit(X_train, y_train)  # Indent here
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    models_scores.append([model_name,accuracy])

sorted_models = sorted(models_scores, key=lambda x: x[1], reverse=True)
for model in sorted_models:
    print("Accuracy Score: ",f'{model[0]} {model[1]:.2f}')


Accuracy Score:  Random Forest 0.82
Accuracy Score:  Logistic Regression 0.81
Accuracy Score:  Decision Tree 0.75
Accuracy Score:  KNN 0.69
Accuracy Score:  SVM 0.66


# Note
## only change these for any data for classification and chek your accurecy score.
df = sns.load_dataset("titanic")\
X = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare']]\
y = df['survived']


## only 3 changes accuracy and all the code is same

**accuracy** = **accuracy_score**(y_test, y_pred)\
    models_scores.append([model_name,**accuracy**])

    Precision = precision_score(y_test, y_pred)
    models_scores.append([model_name,Precision])

    Recall = recall_score(y_test, y_pred)
    models_scores.append([model_name,Recall])

    F1 = f1_score(y_test, y_pred)
    models_scores.append([model_name,F1])


In [3]:
models = [LogisticRegression(), SVC(), DecisionTreeClassifier(), RandomForestClassifier(), KNeighborsClassifier()]
model_names = ['Logistic Regression', 'SVM', 'Decision Tree', 'Random Forest', 'KNN']
models_scores = []
for model, model_name in zip(models, model_names):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    Precision = precision_score(y_test, y_pred)
    models_scores.append([model_name,Precision])

sorted_models = sorted(models_scores, key=lambda x: x[1], reverse=True)
for model in sorted_models:
    print("Precision Score: ", f'{model[0]} : {model[1]:.2f}')

Precision Score:  Logistic Regression : 0.80
Precision Score:  Random Forest : 0.78
Precision Score:  SVM : 0.76
Precision Score:  Decision Tree : 0.71
Precision Score:  KNN : 0.66


In [4]:
models = [LogisticRegression(), SVC(), DecisionTreeClassifier(), RandomForestClassifier(), KNeighborsClassifier()]
model_names = ['Logistic Regression', 'SVM', 'Decision Tree', 'Random Forest', 'KNN']
models_scores = []
for model, model_name in zip(models, model_names):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    Recall = recall_score(y_test, y_pred)
    models_scores.append([model_name,Recall])

sorted_models = sorted(models_scores, key=lambda x: x[1], reverse=True)
for model in sorted_models:
    print("Recall Score: ",f'{model[0]} : {model[1]:.2f}')

Recall Score:  Logistic Regression : 0.72
Recall Score:  Decision Tree : 0.70
Recall Score:  Random Forest : 0.70
Recall Score:  KNN : 0.54
Recall Score:  SVM : 0.26


In [5]:
models = [LogisticRegression(), SVC(), DecisionTreeClassifier(), RandomForestClassifier(), KNeighborsClassifier()]
model_names = ['Logistic Regression', 'SVM', 'Decision Tree', 'Random Forest', 'KNN']
models_scores = []
for model, model_name in zip(models, model_names):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    F1 = f1_score(y_test, y_pred)
    models_scores.append([model_name,F1])

sorted_models = sorted(models_scores, key=lambda x: x[1], reverse=True)
for model in sorted_models:
    print("F1 Score: ",f'{model[0]} : {model[1]:.2f}')


F1 Score:  Random Forest : 0.77
F1 Score:  Logistic Regression : 0.76
F1 Score:  Decision Tree : 0.70
F1 Score:  KNN : 0.59
F1 Score:  SVM : 0.38


# Assignment # 1

# Time stamp 07:51

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)    : what is mean of random _state=42  explain it



 **Here's a breakdown of the meaning of `random_state=42` in the `train_test_split` function:**

**Purpose:**

- It controls the randomness of the data splitting process.
- It ensures that the same random split is generated each time you run the code, making results reproducible.

**How It Works:**

1. **Internal Random Number Generator:** The function employs a random number generator (RNG) to shuffle the data before splitting it.
2. **Seeding the RNG:** The `random_state` argument acts as a seed for this RNG.
3. **Same Seed, Same Sequence:** Setting `random_state` to a specific value guarantees that the RNG produces the same sequence of random numbers each time, leading to consistent splits.

**Key Points:**

- **Reproducibility:** Using `random_state` is crucial for ensuring reproducibility of results, especially when comparing model performance across multiple runs.
- **Arbitrary Number:** The specific value of `42` is a common convention, but any integer can be used as the seed.
- **Default Behavior:** If `random_state` is not set, the function uses a random seed based on the current system time, resulting in different splits each time.

**Example:**

- If you run `train_test_split` twice with `random_state=42`, you'll get the same exact training and testing sets both times.
- Without setting `random_state`, you'll likely get different splits each time, even with the same data and parameters.

**Importance in Machine Learning:**

- **Comparative Analysis:** Reproducibility is essential for comparing different models or hyperparameter settings fairly.
- **Collaboration and Sharing:** It facilitates collaboration and result sharing by enabling others to replicate your results using the same data splits.


# Assignment # 2

# Time stamp  10:54

# sorted_models = sorted(models_scores, key=lambda x: x[1], reverse=True)   what is lambda function 



In Python, a lambda function is a small, anonymous function defined using the `lambda` keyword. It allows you to create a function without formally using the `def` keyword. The general syntax of a lambda function is:

```python
lambda arguments: expression
```

Here, `arguments` are the input parameters, and `expression` is a single expression that is evaluated and returned. Lambda functions are often used for short, simple operations where a full function definition is not necessary.

In the context of your code snippet:

```python
sorted_models = sorted(models_scores, key=lambda x: x[1], reverse=True)
```

Here, `lambda x: x[1]` defines a lambda function that takes an argument `x` and returns the element at index 1 of `x`. This lambda function is then used as the `key` argument in the `sorted` function. The `sorted` function will sort the elements of `models_scores` based on the values returned by the lambda function, which is the second element of each tuple in `models_scores`.

In simple terms, this line of code sorts the list `models_scores` based on the second element of each tuple in descending order.

# Assignment # 3

# Time stamp  22:20

# use the data from kaggle do ml and eda and data wragling analysis and sellect the best model of classification.

## create id on kaggle



# Assignment # 4

# Time stamp  

# Explain best model code or comment this code
This code performs the following steps:

1. Imports necessary libraries: `pandas`, `numpy`, `seaborn`, and `matplotlib.pyplot`.
2. Loads the Titanic dataset using seaborn's `load_dataset` function and extracts features `X` and target variable `y`.
3. Selects specific columns from the features (`'pclass'`, `'sex'`, `'age'`, `'sibsp'`, `'parch'`, `'fare'`) and one-hot encodes the `'sex'` column using `pd.get_dummies`.
4. Fills missing values in the `'age'` column with the mean age.
5. Imports machine learning models (`LogisticRegression`, `SVC`, `DecisionTreeClassifier`, `RandomForestClassifier`, `KNeighborsClassifier`) and evaluation metrics (`accuracy_score`, `f1_score`, `precision_score`, `recall_score`) from scikit-learn.
6. Splits the data into training and testing sets using `train_test_split`.
7. Creates a list of models and their corresponding names.
8. Iterates over each model, fits it to the training data, predicts on the test data, and calculates accuracy.
9. Stores the model names and their corresponding accuracy scores in the `models_scores` list.
10. Sorts the models based on accuracy in descending order.
11. Prints the accuracy scores for each model.

The code uses a common machine learning workflow to compare the performance of different classification models on the Titanic dataset. It focuses on the accuracy metric and prints the results in descending order. Note that depending on the dataset and problem, you may want to consider other metrics and perform more in-depth analysis, such as hyperparameter tuning or cross-validation, for a comprehensive evaluation of model performance.

