# Task
Load the data from "data.csv", train a logistic regression model, and evaluate its performance.

## Load the data

### Subtask:
Load the data from the provided CSV file into a pandas DataFrame.


**Reasoning**:
Load the data from the CSV file into a pandas DataFrame and display the head and info.



**Reasoning**:
The previous attempt failed because the file 'data.csv' was not found. The available file is 'framingham.csv'. Load the correct file and display the head and info.



In [3]:
df = pd.read_csv('/content/framingham.csv')
display(df.head())
display(df.info())

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4238 entries, 0 to 4237
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             4238 non-null   int64  
 1   age              4238 non-null   int64  
 2   education        4133 non-null   float64
 3   currentSmoker    4238 non-null   int64  
 4   cigsPerDay       4209 non-null   float64
 5   BPMeds           4185 non-null   float64
 6   prevalentStroke  4238 non-null   int64  
 7   prevalentHyp     4238 non-null   int64  
 8   diabetes         4238 non-null   int64  
 9   totChol          4188 non-null   float64
 10  sysBP            4238 non-null   float64
 11  diaBP            4238 non-null   float64
 12  BMI              4219 non-null   float64
 13  heartRate        4237 non-null   float64
 14  glucose          3850 non-null   float64
 15  TenYearCHD       4238 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 529.9 KB


None

## Prepare the data

### Subtask:
Clean and preprocess the data for logistic regression. This may involve handling missing values, encoding categorical variables, and splitting the data into training and testing sets.


**Reasoning**:
Handle missing values by imputing the mean for numerical columns, separate features and target, and split the data into training and testing sets.



In [4]:
from sklearn.model_selection import train_test_split
import numpy as np

# Impute missing numerical values with the mean
for col in df.columns:
    if df[col].dtype in [np.float64, np.int64]:
        df[col] = df[col].fillna(df[col].mean())

# Separate features and target
X = df.drop('TenYearCHD', axis=1)
y = df['TenYearCHD']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

display(X_train.head())
display(y_train.head())

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose
3252,1,40,4.0,1,30.0,0.0,0,0,0,205.0,131.0,81.0,23.74,66.0,87.0
3946,0,57,2.0,0,0.0,0.0,0,1,0,250.0,152.5,92.5,32.31,75.0,94.0
1261,0,47,1.0,0,0.0,0.0,0,0,0,230.0,123.0,71.0,26.98,83.0,73.0
2536,1,41,2.0,1,30.0,0.0,0,0,0,228.0,113.0,82.5,25.67,67.0,70.0
4089,0,64,1.0,0,0.0,0.0,0,1,0,232.0,149.5,84.0,20.49,68.0,96.0


Unnamed: 0,TenYearCHD
3252,0
3946,0
1261,0
2536,0
4089,0


## Build the logistic regression model

### Subtask:
Train a logistic regression model on the training data.


**Reasoning**:
Train a logistic regression model using the training data.



In [5]:
from sklearn.linear_model import LogisticRegression

# Instantiate the model
model = LogisticRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Evaluate the model

### Subtask:
Evaluate the performance of the trained model using appropriate metrics on the testing data.


**Reasoning**:
Evaluate the performance of the trained model using appropriate metrics on the testing data.



In [6]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

Accuracy: 0.8561
Precision: 0.6250
Recall: 0.0403
F1 Score: 0.0758


## Summary:

### Data Analysis Key Findings

*   The dataset `framingham.csv` contains 4238 entries and 16 columns.
*   Several columns (`education`, `cigsPerDay`, `BPMeds`, `totChol`, `BMI`, `heartRate`, `glucose`) have missing values, which were imputed with the mean of each respective column.
*   The data was split into training and testing sets with an 80/20 ratio.
*   A logistic regression model was trained on the training data. A `ConvergenceWarning` was noted during training, suggesting potential issues with convergence that could be addressed by scaling the data or increasing `max_iter`.
*   The model's performance on the testing data was evaluated using several metrics:
    *   Accuracy: 0.8561
    *   Precision: 0.6250
    *   Recall: 0.0403
    *   F1 Score: 0.0758

### Insights or Next Steps

*   The low recall (0.0403) and F1 score (0.0758) indicate that the model is poor at identifying positive cases (TenYearCHD=1). This suggests the dataset might be imbalanced, or the current features are not sufficient to predict the target effectively.
*   Further steps should involve addressing the class imbalance (if present) using techniques like oversampling or undersampling, scaling the features to potentially improve model convergence and performance, and exploring other models or feature engineering techniques to improve the identification of positive cases.
