Here are some cool and useful functions you can create for data analysis in Python:


## Automatically detect and handle missing data: 
Create a function that scans a DataFrame for missing values, and fills them in with appropriate values (e.g. mean, median, mode) or removes the rows/columns with too many missing values.

In [1]:
def handle_missing(df):
    """Fill missing values with mean, median or mode and/or drop rows/cols""" 
    for col in df.columns:
        if df[col].isnull().sum() > 0:
            if df[col].dtype == 'object': 
                df[col] = df[col].fillna(df[col].mode()[0])
            else:
                df[col] = df[col].fillna(df[col].median()) 
    df.dropna(inplace=True)
    return df

## Automated data visualization: 
Write a function that takes in a DataFrame and automatically generates visualizations based on data types and distributions. For example, call histogram for numerical columns, boxplot for categorical columns, correlation matrix for all columns etc.



In [None]:
import matplotlib.pyplot as plt

def auto_visualize(df):
    """Generate visualizations based on data type"""
    
    for col in df.columns:
        if df[col].dtype == 'object':
            df[col].value_counts().plot(kind='bar')
        elif df[col].dtype == 'int64' or df[col].dtype == 'float64':
            df[col].hist()
            
    plt.show()

## Feature engineering helper:
Make a function that takes in a DataFrame and allows you to easily create new features by applying operations on existing columns. For example: divide numeric columns, concatenate text columns, extract date/time information etc.

In [2]:
def create_features(df):
    """Perform feature engineering"""
    
    df['age_cat'] = df['age'].apply(lambda x: 'young' if x < 30 else 'old') 
    df['full_name'] = df['first_name'] + ' ' + df['last_name']
    
    return df

## Customizable DataFrame summary: 
Create a function that prints key statistics on a DataFrame, with options to customize what's included - things like count, mean, std dev, min, max for numeric cols, value counts for categorical etc.



- Automated model training: Make a function that automatically trains a bunch of models on a dataset to find the best performing one. Have it split data, instantiate/fit models like RandomForest, SVM, NeuralNet etc. and output metrics.

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import pandas as pd

def auto_train_models(df, target):
    """
    Automatically train and evaluate SVM, Random Forest and Neural Net models.
    Output accuracy scores for comparison.
    """
    
    # Split data
    X = df.drop(target, axis=1) 
    y = df[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Train models
    svm = SVC().fit(X_train, y_train)
    rf = RandomForestClassifier().fit(X_train, y_train) 
    nn = MLPClassifier().fit(X_train, y_train)
    
    # Evaluate models
    print("SVM Accuracy:", accuracy_score(y_test, svm.predict(X_test)))
    print("Random Forest Accuracy:", accuracy_score(y_test, rf.predict(X_test))) 
    print("Neural Network Accuracy:", accuracy_score(y_test, nn.predict(X_test)))

In [None]:
# to use
auto_train_models(df, 'target_col_name')

- Model comparison function: Create a function that takes in multiple trained models and a test set, runs predictions and outputs a table/chart to easily compare performance across models.


In [None]:
import pandas as pd
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt

def compare_models(models, X_test, y_test):
    """
    Compare performance of multiple ML models, print scores 
    and plot confusion matrices.
    """
    
    results = pd.DataFrame(columns=['model', 'accuracy', 'precision', 'recall', 'f1'])
    
    for model in models:
        
        # Make predictions on test set
        y_pred = model.predict(X_test)
        
        # Calculate scores
        acc = accuracy_score(y_test, y_pred)
        prec = precision_score(y_test, y_pred) 
        rec = recall_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)
        
        # Append scores
        results = results.append({'model': model.__class__.__name__, 
                                  'accuracy': acc, 
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1}, ignore_index=True)
                                  
    # Print scores                                  
    print(results)
    
    # Plot confusion matrices
    for i, model in enumerate(models):
        
       disp = plot_confusion_matrix(model, X_test, y_test)  
       disp.figure_.suptitle(model.__class__.__name__)
       plt.show()


The key is to automate repetitive tasks and build reusable functions that make your analysis & modeling workflows more efficient!

## Building a simple pipeline using Scikit-learn
This example demonstrates a basic pipeline using Scikit-learn to perform data preprocessing, feature extraction, and classification.



Sure, here are some examples of pipeline automation for machine learning in Python:

**Example 1: Building a simple pipeline using Scikit-learn**

This example demonstrates a basic pipeline using Scikit-learn to perform data preprocessing, feature extraction, and classification.

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

# Create a pipeline object
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('tfidf', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])

# Train the pipeline on the data
pipeline.fit(X_train, y_train)

# Make predictions on new data
y_pred = pipeline.predict(X_test)
```

**Example 2: Automating hyperparameter tuning using GridSearchCV**

This example illustrates how to automate hyperparameter tuning using GridSearchCV within a pipeline.

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

# Create a pipeline object
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('tfidf', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])

# Define a parameter grid for hyperparameter tuning
param_grid = {
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'classifier__alpha': [0.1, 0.5, 1.0]
}

# Perform GridSearchCV to find the optimal hyperparameters
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print the best parameters and score
print('Best parameters:', grid_search.best_params_)
print('Best score:', grid_search.best_score_)
```

**Example 3: Implementing a custom pipeline component**

This example demonstrates how to create a custom pipeline component for data preprocessing.

```python
import pandas as pd

class DataCleaner:

    def fit(self, X, y=None):
        # Perform data cleaning operations
        # ...

    def transform(self, X):
        # Clean and transform the data
        # ...

        return X_cleaned

# Create a pipeline object
pipeline = Pipeline([
    ('data_cleaner', DataCleaner()),
    ('scaler', StandardScaler()),
    ('tfidf', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])

# Train the pipeline on the data
pipeline.fit(X_train, y_train)

# Make predictions on new data
y_pred = pipeline.predict(X_test)
```

These examples showcase the flexibility and power of pipeline automation for machine learning in Python. By combining various components and automating hyperparameter tuning, you can streamline the development and evaluation of machine learning models.