### Machine Learning for Data Quality Prediction
**Description**: Use a machine learning model to predict data quality issues.

**Steps**:
1. Create a mock dataset with features and label (quality issue/label: 0: good, 1: issue).
2. Train a machine learning model.
3. Evaluate the model performance.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Step 1: Create mock dataset
np.random.seed(42)
data_size = 1000

# Features (example): missing_value_rate, num_outliers, avg_data_age, inconsistent_records
df = pd.DataFrame({
    "missing_value_rate": np.random.rand(data_size),  # 0 to 1
    "num_outliers": np.random.poisson(lam=2, size=data_size),
    "avg_data_age": np.random.randint(1, 365, size=data_size),  # days
    "inconsistent_records": np.random.randint(0, 5, size=data_size)
})

# Label: 0 (good quality), 1 (quality issue)
# Let's say higher missing_value_rate and inconsistent_records increase chance of quality issues
df["quality_issue"] = ((df["missing_value_rate"] > 0.4) | (df["inconsistent_records"] > 2)).astype(int)

print("Sample data:\n", df.head())

# Step 2: Train ML model
X = df.drop("quality_issue", axis=1)
y = df["quality_issue"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Step 3: Evaluate model
y_pred = model.predict(X_test)

print("\nModel Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

Sample data:
    missing_value_rate  num_outliers  avg_data_age  inconsistent_records  \
0            0.374540             1           343                     3   
1            0.950714             6           126                     3   
2            0.731994             1            75                     4   
3            0.598658             2           165                     1   
4            0.156019             2           278                     2   

   quality_issue  
0              1  
1              1  
2              1  
3              1  
4              0  

Model Performance:
Accuracy: 0.995

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.98      0.99        54
           1       0.99      1.00      1.00       146

    accuracy                           0.99       200
   macro avg       1.00      0.99      0.99       200
weighted avg       1.00      0.99      0.99       200


Confusion Matrix:
 [[ 53   1]
 [