<a href="https://colab.research.google.com/github/Farhaan427/TRAINING-3-2/blob/main/Logistic%20Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Perform logistic regression on the `heart.csv` dataset, evaluate the model's accuracy, precision, and recall, and identify the most important feature based on the model's coefficients.

## Load Dataset

### Subtask:
Load the provided 'heart.csv' dataset into a pandas DataFrame.


**Reasoning**:
First, I'll import the pandas library to enable data manipulation. Then, I'll load the 'heart.csv' dataset into a DataFrame, display its first 5 rows to verify the data, and show the columns and their data types to understand the dataset's structure.



In [2]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/content/heart.csv')

# Display the first 5 rows
print("First 5 rows of the DataFrame:")
print(df.head())

# Display columns and their data types
print("\nDataFrame Info:")
df.info()

First 5 rows of the DataFrame:
   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   63    1   3       145   233    1        0      150      0      2.3      0   
1   37    1   2       130   250    0        1      187      0      3.5      0   
2   41    0   1       130   204    0        0      172      0      1.4      2   
3   56    1   1       120   236    0        1      178      0      0.8      2   
4   57    0   0       120   354    0        1      163      1      0.6      2   

   ca  thal  target  
0   0     1       1  
1   0     2       1  
2   0     2       1  
3   0     2       1  
4   0     2       1  

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    in

## Prepare Data for Modeling

### Subtask:
Separate the features (X) and the target variable (y) from the dataset. Then, split the data into training and testing sets to prepare for logistic regression.


**Reasoning**:
To prepare the data for logistic regression, I will first separate the features (X) and the target variable (y), and then split them into training and testing sets using a 80-20 ratio and a random state for reproducibility.



In [3]:
from sklearn.model_selection import train_test_split

# Separate features (X) and target variable (y)
X = df.drop('target', axis=1)
y = df['target']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (242, 13)
X_test shape: (61, 13)
y_train shape: (242,)
y_test shape: (61,)


## Apply Logistic Regression

### Subtask:
Train a logistic regression model on the prepared training data.


**Reasoning**:
To train the logistic regression model, I will import the `LogisticRegression` class, instantiate it with `random_state=42`, and then fit the model to the training data (`X_train`, `y_train`).



In [4]:
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
model = LogisticRegression(random_state=42)

# Fit the model to the training data
model.fit(X_train, y_train)

print("Logistic Regression model trained successfully.")

Logistic Regression model trained successfully.


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**Reasoning**:
The previous logistic regression model issued a `ConvergenceWarning` because the optimizer reached its iteration limit. To address this, I will scale the features using `StandardScaler` to help with convergence and also increase the `max_iter` parameter in the `LogisticRegression` model.



In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler
scaler = StandardScaler()

# Scale the training and testing data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Instantiate the Logistic Regression model with increased max_iter
# The default solver 'lbfgs' is generally good, and max_iter helps with convergence issues.
model = LogisticRegression(random_state=42, max_iter=1000)

# Fit the model to the scaled training data
model.fit(X_train_scaled, y_train)

print("Logistic Regression model trained successfully on scaled data.")

Logistic Regression model trained successfully on scaled data.


## Evaluate Model Performance

### Subtask:
Calculate and display the accuracy, precision, and recall of the logistic regression model on the test data.


**Reasoning**:
To evaluate the model's performance, I will first make predictions on the scaled test data using the trained model. Then, I will import the necessary metrics from `sklearn.metrics` and calculate the accuracy, precision, and recall scores, finally printing them with clear labels.



In [6]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Make predictions on the scaled test data
y_pred = model.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Calculate precision
precision = precision_score(y_test, y_pred)

# Calculate recall
recall = recall_score(y_test, y_pred)

# Print the results
print(f"Model Accuracy: {accuracy:.4f}")
print(f"Model Precision: {precision:.4f}")
print(f"Model Recall: {recall:.4f}")

Model Accuracy: 0.8525
Model Precision: 0.8710
Model Recall: 0.8438


## Identify Most Important Column

### Subtask:
Determine and display the most important feature (column) based on the coefficients of the trained logistic regression model.


**Reasoning**:
To identify the most important feature, I will access the logistic regression model's coefficients, map them to the original feature names, and then find the feature with the largest absolute coefficient value. I will then display the most important feature and its coefficient.



In [7]:
import numpy as np

# Get the coefficients from the trained model
coefficients = model.coef_[0]

# Get the feature names from the original X DataFrame
feature_names = X.columns

# Create a Series to easily map feature names to their coefficients
coef_series = pd.Series(coefficients, index=feature_names)

# Find the feature with the largest absolute coefficient value
most_important_feature = coef_series.abs().idxmax()
most_important_coefficient = coef_series[most_important_feature]

print(f"The most important feature is: '{most_important_feature}'")
print(f"Its coefficient is: {most_important_coefficient:.4f}")

The most important feature is: 'cp'
Its coefficient is: 0.8179


## Final Task

### Subtask:
Provide a summary of the logistic regression model's performance metrics and the identified most important feature from the dataset.


## Summary:

### Q&A
*   **What are the logistic regression model's performance metrics?**
    The logistic regression model achieved an accuracy of 0.8525, a precision of 0.8710, and a recall of 0.8438 on the test data.
*   **What is the identified most important feature from the dataset based on the model?**
    The most important feature identified is 'cp' (chest pain type), with a coefficient of 0.8179.

### Data Analysis Key Findings
*   The `heart.csv` dataset, containing 303 entries and 14 columns, was successfully loaded and processed.
*   The data was split into training (242 samples) and testing (61 samples) sets, with 13 features used for prediction.
*   A logistic regression model was trained on scaled data, and the `max_iter` parameter was increased to 1000 to ensure convergence.
*   The model demonstrated good predictive capability on the test set, with an accuracy of 85.25%, a precision of 87.10%, and a recall of 84.38%.
*   The feature 'cp' (chest pain type) was identified as the most influential predictor of the target variable, possessing the largest absolute coefficient of 0.8179.

### Insights or Next Steps
*   Further investigation into the 'cp' feature is warranted to understand its exact relationship with heart disease, given its significant impact on the model.
*   Consider exploring other advanced classification models or hyperparameter tuning for the current logistic regression model to potentially enhance performance metrics.
