<a href="https://colab.research.google.com/github/Cliffochi/aviva_data_science_course/blob/main/scratch_train_test_split.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Manual implementation

In [1]:
def scratch_train_test_split(X, y, train_size=0.8, random_state=0):
    np.random.seed(random_state)
    y = y.reshape(-1, 1)
    Xy = np.concatenate([X, y], axis=1)
    size = len(Xy)
    pick = int(np.round(size * train_size))
    train_pick = np.random.choice(np.arange(size), pick, replace=False)
    test_pick = np.delete(np.arange(size), train_pick)
    train = Xy[train_pick, :]
    test = Xy[test_pick, :]
    X_train = train[:, 0:(Xy.shape[1] - y.shape[1])].reshape(-1, X.shape[1])
    y_train = train[:, -y.shape[1]].reshape(-1, )
    X_test = test[:, 0:(Xy.shape[1] - y.shape[1])].reshape(-1, X.shape[1])
    y_test = test[:, -y.shape[1]].reshape(-1, )
    return X_train, X_test, y_train, y_test

Once the implementation is complete, we check that it works in our local environment. We will use the Ayame data set here. Checking at this stage is also frequently done in practice, and is an important task for finding errors early.

In [2]:
import numpy as np
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = scratch_train_test_split(X, y)
print(f'X', X.shape)
print(f'y', y.shape)
print(f'X_train', X_train.shape)
print(f'X_test', X_test.shape)
print(f'y_train', y_train.shape)
print(f'y_test', y_test.shape)

X (150, 4)
y (150,)
X_train (120, 4)
X_test (30, 4)
y_train (120,)
y_test (30,)


If you get the expected output, you can say that the scratch implementation is working correctly. In this case, you can also confirm that you get the same output when you input the same data train_test_splitinto . This kind of comparison method is always an essential perspective for checking the validity of results.

### Creating a base model for classification problems

In [3]:
# obtaining iris data and creating a dataset
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

# Load dataset
data = load_iris().data
target = load_iris().target.reshape(-1, 1)
iris = np.concatenate([data, target], axis=1)

# Create DataFrame
iris = pd.DataFrame(iris)

# Create data for binary classification
iris_X = iris.loc[iris[4] != 0, 0:3].values
iris_y = iris.loc[iris[4] != 0, 4].values

# Split dataset
X_train, X_test, y_train, y_test = scratch_train_test_split(X, y, train_size=0.8, random_state=0)

In [6]:
# creating a classifier
from sklearn.linear_model import SGDClassifier

# Create a logistic regression model
clf = SGDClassifier(loss="log_loss")
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [5]:
from sklearn.svm import SVC

# Create an SVM model
clf = SVC(gamma='auto')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [7]:
from sklearn.tree import DecisionTreeClassifier

# Create a decision tree model
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)


To evaluate the performance of a classifier, we use the following metrics: Measuring model performance is a computationally intensive process, but it is important for accurate evaluation.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# evaluation
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [9]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# evaluation
accuracy = accuracy_score(y_test, y_pred)
# Use 'weighted' for multiclass classification
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

### Regression

In [11]:
# data creation
import pandas as pd # Load pandas
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/train.csv') # Load CSV
X = train[['GrLivArea', 'YearBuilt']].values # Select explanatory variables
y = train[['SalePrice']].values # Select objective variables
X_train, X_test, y_train, y_test = scratch_train_test_split(X, y, train_size=0.8, random_state=0) # Split data

In [12]:
# pretreatment (standardization)
from sklearn.preprocessing import StandardScaler # Load the class for standardization
scaler = StandardScaler() # Instantiate the class
scaler.fit(X_train) # Train the model
X_train_std = scaler.transform(X_train) # Standardize the training data
X_test_std = scaler.transform(X_test) # Standardize the test data


In [13]:
# training the model
from sklearn.linear_model import SGDRegressor # Load linear regression model (stochastic gradient descent)
reg = SGDRegressor() # Instantiate the class
reg.fit(X_train_std, y_train) # Train the model
y_pred = reg.predict(X_test_std) # Run prediction

In [14]:
# evaluating the model
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
mse = mean_squared_error(y_test, y_pred) # Calculate MSE
rmse = np.sqrt(mse) # Calculate RMSE
r2 = r2_score(y_test, y_pred) # Calculate R2 score