# Data Science Workflow: End-to-End Example

This notebook demonstrates a complete data science workflow including data loading, preprocessing, feature engineering, model training, evaluation, and result export.

## 1. Load and Explore Data

Load the dataset using pandas and perform initial exploration such as displaying the first few rows, checking for missing values, and summarizing statistics.

In [1]:
import pandas as pd

# Load dataset (replace 'data.csv' with your actual file)
df = pd.read_csv('data.csv')

# Display first few rows
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

In [2]:
# Check for missing values
df.isnull().sum()

# Summary statistics
df.describe()

NameError: name 'df' is not defined

## 2. Preprocess Data

Clean the data by handling missing values, encoding categorical variables, and normalizing or scaling features as needed.

In [3]:
# Fill missing values (example: fill with mean for numeric columns)
for col in df.select_dtypes(include='number').columns:
    df[col].fillna(df[col].mean(), inplace=True)

# Encode categorical variables
for col in df.select_dtypes(include='object').columns:
    df[col] = df[col].astype('category').cat.codes

NameError: name 'df' is not defined

In [4]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[df.columns] = scaler.fit_transform(df[df.columns])

NameError: name 'df' is not defined

## 3. Feature Engineering

Create new features or transform existing ones to improve model performance, such as extracting date parts or combining columns.

In [5]:
# Example: Create a new feature as sum of two columns
if 'feature1' in df.columns and 'feature2' in df.columns:
    df['feature_sum'] = df['feature1'] + df['feature2']

NameError: name 'df' is not defined

## 4. Model Selection and Training

Select appropriate machine learning models, split the data into training and test sets, and train the models using scikit-learn or other libraries.

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Assume target column is named 'target'
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier()
model.fit(X_train, y_train)

NameError: name 'df' is not defined

## 5. Model Evaluation

Evaluate model performance using metrics such as accuracy, precision, recall, F1-score, or RMSE, and visualize results with plots.

In [7]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt

# Predict and evaluate
y_pred = model.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred, average='weighted'))
print('Recall:', recall_score(y_test, y_pred, average='weighted'))
print('F1 Score:', f1_score(y_test, y_pred, average='weighted'))

# Plot feature importances
plt.figure(figsize=(10,6))
plt.bar(X.columns, model.feature_importances_)
plt.xticks(rotation=90)
plt.title('Feature Importances')
plt.show()

NameError: name 'model' is not defined

## 6. Save and Export Results

Save the trained model and export predictions or evaluation results to files for further use.

In [8]:
import joblib

# Save trained model
joblib.dump(model, 'trained_model.pkl')

# Export predictions
pd.DataFrame({'y_true': y_test, 'y_pred': y_pred}).to_csv('predictions.csv', index=False)

NameError: name 'model' is not defined