<a href="https://colab.research.google.com/github/ARPANPATRA111/googlecolab/blob/main/%234DATA_PREPARATION_PHASE_TO_MODEL_THE_DATA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#4:  DATA PREPARATION PHASE TO MODEL THE DATA

**1. How to Partition the Data R/Python**

In [1]:
# Partitioning data into training and test sets using scikit-learn

import pandas as pd
from sklearn.model_selection import train_test_split

# Sample data
data = {'Feature1': [2.5, 3.6, 4.8, 5.1, 6.3],
        'Feature2': [1.5, 2.4, 3.6, 4.1, 5.0],
        'Target': [0, 1, 1, 0, 1]}

df = pd.DataFrame(data)

# Split the data into training and test sets (80% train, 20% test)
X = df[['Feature1', 'Feature2']]
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training Data:\n", X_train)
print("Test Data:\n", X_test)


Training Data:
    Feature1  Feature2
4       6.3       5.0
2       4.8       3.6
0       2.5       1.5
3       5.1       4.1
Test Data:
    Feature1  Feature2
1       3.6       2.4


**2. Howto Balance the Training Data Set R/Python**

In [2]:
# Balancing the training data set using RandomOverSampler from imbalanced-learn

import pandas as pd
from imblearn.over_sampling import RandomOverSampler

# Sample imbalanced data
data = {'Feature1': [1, 2, 3, 4, 5, 6],
        'Feature2': [10, 20, 30, 40, 50, 60],
        'Target': [0, 0, 0, 0, 1, 1]}

df = pd.DataFrame(data)

# Split features and target
X = df[['Feature1', 'Feature2']]
y = df['Target']

# Apply random oversampling
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

# Convert back to DataFrame for visualization
resampled_df = pd.DataFrame(X_resampled, columns=['Feature1', 'Feature2'])
resampled_df['Target'] = y_resampled

print("Balanced Data Set:\n", resampled_df)


Balanced Data Set:
    Feature1  Feature2  Target
0         1        10       0
1         2        20       0
2         3        30       0
3         4        40       0
4         5        50       1
5         6        60       1
6         5        50       1
7         6        60       1


**3. How to Build CART Decision Trees Using R/Python**

In [3]:
# Building a CART decision tree using scikit-learn

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build a CART decision tree
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Predict and evaluate the model
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("CART Decision Tree Accuracy:", accuracy)


CART Decision Tree Accuracy: 1.0


**5. How to Build Random Forests R/Python**

In [4]:
# Building a random forest classifier using scikit-learn

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build a random forest classifier
rf_clf = RandomForestClassifier(n_estimators=100)
rf_clf.fit(X_train, y_train)

# Predict and evaluate the model
y_pred = rf_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Random Forest Accuracy:", accuracy)


Random Forest Accuracy: 1.0
