PREDICTIVE ANALYSIS

Option 1: Use Scikit-Learn's Built-in Datasets
Scikit-learn provides some built-in datasets, such as iris and digits, which are useful for practicing classification. For a binary classification example, let’s use the make_classification function to create synthetic data.

Example: Generate Synthetic Data for Binary Classification

In [1]:
from sklearn.datasets import make_classification
import pandas as pd

# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)

# Convert to DataFrame for a better view
df = pd.DataFrame(X, columns=[f'feature{i+1}' for i in range(X.shape[1])])
df['target'] = y

# Display the first few rows
print(df.head())


   feature1  feature2  feature3  feature4  feature5  feature6  feature7  \
0  0.964799 -0.066449  0.986768 -0.358079  0.997266  1.181890 -1.615679   
1 -0.916511 -0.566395 -1.008614  0.831617 -1.176962  1.820544  1.752375   
2 -0.109484 -0.432774 -0.457649  0.793818 -0.268646 -1.836360  1.239086   
3  1.750412  2.023606  1.688159  0.006800 -1.607661  0.184741 -2.619427   
4 -0.224726 -0.711303 -0.220778  0.117124  1.536061  0.597538  0.348645   

   feature8  feature9  feature10  target  
0 -1.210161 -0.628077   1.227274       0  
1 -0.984534  0.363896   0.209470       1  
2 -0.246383 -1.058145  -0.297376       1  
3 -0.357445 -1.473127  -0.190039       0  
4 -0.939156  0.175915   0.236224       1  


This will create a dataset with:

1000 samples
10 features named feature1 to feature10
Binary target variable labeled target, with classes 0 and 1

-------------------

***Option 3: Use Real-World Data from Scikit-Learn***

Scikit-learn offers some popular real-world datasets that can be loaded directly, like the Iris dataset, Boston housing data, and more.

In [2]:
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Display the first few rows
print(df.head())


   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  


Step 1: Load the Data

In [3]:
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Display the first few rows of the data
print(df.head())


   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  


Step 2: Split the Data into Training and Test Sets

We’ll split the dataset into a training set (80%) and a test set (20%) for model evaluation.

In [4]:
from sklearn.model_selection import train_test_split

# Separate features and target
X = df[iris.feature_names]
y = df['target']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {X_train.shape[0]}, Test samples: {X_test.shape[0]}")


Training samples: 120, Test samples: 30


Step 3: Train a Model

For this example, we’ll use a Random Forest Classifier, which is a robust choice for classification tasks.

In [5]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the model
model = RandomForestClassifier(random_state=42)

# Train the model on the training data
model.fit(X_train, y_train)


Step 4: Make Predictions

With the trained model, we can now make predictions on the test data.

In [6]:
# Predict on the test data
y_pred = model.predict(X_test)

# Display predictions
print("Predictions:", y_pred)


Predictions: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]


Step 5: Evaluate the Model

Let’s evaluate the model’s accuracy using the test set.

In [7]:
from sklearn.metrics import accuracy_score, classification_report

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Print a detailed classification report
print(classification_report(y_test, y_pred, target_names=iris.target_names))


Accuracy: 100.00%
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



Full Code Example
Here's the complete code for reference:

In [8]:
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Separate features and target
X = df[iris.feature_names]
y = df['target']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Predict on the test data
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
print(classification_report(y_test, y_pred, target_names=iris.target_names))


Accuracy: 100.00%
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



Explanation of the Output

Accuracy: Shows how often the model makes the correct prediction.

Classification Report: Provides precision, recall, F1-score, and support for each class.

This basic model should achieve a high accuracy on the Iris dataset, which is a well-balanced dataset. You can experiment with different models (like LogisticRegression, KNeighborsClassifier, or SVC) or tune the RandomForestClassifier parameters for better performance.