# Autoformat notebook example

This notebook contains some code that violates PEP8 style guidelines.  It is designed to be used with the `black` autoformatting tool to demonstrate how code is modified for you.

**RECOMMENDED**: make a **copy** of this notebook before you auto-format it!  That way you can try different settings to see the results.

## Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

## 1. Violations of PEP8 in the code example.

* Line length: Many lines exceed 79 or 100 characters.
* Inconsistent indentation and whitespace: There are irregular spaces between function parameters and after commas.
* No blank lines between logical sections of the code.
* Improper string formatting: The long print statement at the end should be broken into multiple lines.

In [2]:
def messy_data_analysis(data_frame,columns_to_process,    numeric_columns,categorical_columns,      target_variable):   
    # Preprocessing
    data_frame = data_frame[columns_to_process]
    data_frame[numeric_columns] = data_frame[numeric_columns].fillna(data_frame[numeric_columns].mean())
    data_frame[categorical_columns] = data_frame[categorical_columns].fillna(data_frame[categorical_columns].mode().iloc[0])
    
    # Feature engineering
    scaler = StandardScaler()
    data_frame[numeric_columns] = scaler.fit_transform(data_frame[numeric_columns])
    
    encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
    encoded_categorical = encoder.fit_transform(data_frame[categorical_columns])
    encoded_feature_names = encoder.get_feature_names(categorical_columns)
    
    encoded_df = pd.DataFrame(encoded_categorical, columns=encoded_feature_names, index=data_frame.index)
    
    processed_data = pd.concat([data_frame[numeric_columns], encoded_df, data_frame[target_variable]], axis=1)
    
    # Split the data
    X = processed_data.drop(columns=[target_variable])
    y = processed_data[target_variable]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Train a simple model
    model = LogisticRegression(random_state=42, max_iter=1000)
    model.fit(X_train, y_train)
    
    # Evaluate the model
    train_accuracy = model.score(X_train, y_train)
    test_accuracy = model.score(X_test, y_test)
    
    print(f"This function performs data preprocessing, feature engineering, and trains a logistic regression model on the given dataset. The model achieves a training accuracy of {train_accuracy:.2f} and a test accuracy of {test_accuracy:.2f}. Please note that this is a basic analysis and may not be suitable for all datasets or problems. Further optimization and model selection might be necessary for better results.")
    
    return model, train_accuracy, test_accuracy
