<a href="https://colab.research.google.com/github/Dashnyam7/Scratch/blob/main/Base_model_line_creation_for_classification_problem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification

In supervised learning, a dataset consisting of both features and target variables (labels) is prepared.
What I want to do is construct a model (estimator) that predicts object labels from features.

If the target variable is nominal (discrete) , treat it as a classification problem.

example:

Predict the type of iris from the measured value of the iris.

Given a polychromatic image of an object through a telescope, determine whether the object is a star, a quasar, or a galaxy.

## Problem 2 Creating a code to solve the classification problem

### Acquisition of iris data (iris data) and creation of dataset

Get the iris data and split it into training data and test data

Generate a model (estimator) that solves the classification problem

Predict and evaluate using prediction results and correct values

In [3]:
def scratch_train_test_split(X,y,train_size=0.8,random_state=0):
    np.random.seed(random_state)
    y = y.reshape(-1,1)
    Xy = np.concatenate([X,y],axis=1)
    size = len(Xy)
    pick = int(np.round(size*train_size))
    train_pick = np.random.choice(np.arange(size),pick,replace=False)
    test_pick = np.delete(np.arange(size),train_pick)
    train = Xy[train_pick,:]
    test = Xy[test_pick,:]
    X_train = train[:,0:(Xy.shape[1]-y.shape[1])].reshape(-1,X.shape[1])
    y_train = train[:,-y.shape[1]].reshape(-1,)
    X_test = test[:,0:(Xy.shape[1]-y.shape[1])].reshape(-1,X.shape[1])
    y_test = test[:,-y.shape[1]].reshape(-1,)
    return X_train, X_test, y_train, y_test

In [4]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
data = load_iris().data
target = load_iris().target.reshape(-1,1)
iris = np.concatenate([data,target],axis=1)
iris = pd.DataFrame(iris)
iris_X = iris.loc[iris[4]!=0,0:3].values
iris_y = iris.loc[iris[4]!=0,4].values
X = iris_X
y = iris_y
X_train, X_test, y_train, y_test = scratch_train_test_split(X,y,train_size=0.8,random_state=0)

### Generate various models (estimators) that solve classification problems

logistic regression

In [7]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss="log")
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
y_pred

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 2., 2., 2., 2., 1., 1.,
       2., 1., 2.])

SVM

In [8]:
from sklearn.svm import SVC
clf = SVC(gamma='auto')
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
y_pred

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 2., 2., 2., 2., 2., 2.,
       2., 2., 2.])

Decision tree

In [10]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
y_pred

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 2., 2., 2., 2., 2., 1.,
       2., 1., 2.])

### Evalution

In [11]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Regression

If the target variable is continuous , treat it as a regression problem.

example:

Determine the selling price of a house given a set of attributes

## Creating code to solve problem 3 regression problem

In [12]:
train = pd.read_csv('train.csv')
X = train[['GrLivArea','YearBuilt']].values
y = train[['SalePrice']].values
X_train, X_test, y_train, y_test = scratch_train_test_split(X,y,train_size=0.8,random_state=0)

In [13]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_std = scaler.transform(X_train)
X_test_std = scaler.transform(X_test)

In [18]:
from sklearn.linear_model import SGDRegressor
reg = SGDRegressor()
reg.fit(X_train_std, y_train)
y_pred = reg.predict(X_test_std)
y_pred

array([292825.46462064, 144055.06919054, 135711.83259842, 227021.28812361,
       174876.05942282, 192701.21250746, 186338.93841133, 166296.8673638 ,
       151456.23860476, 131572.13941114, 235145.05821013, 211503.41532057,
       235124.91093085, 228700.08936923, 123549.20672072, 202691.32523642,
       221184.65103938, 193849.31467144, 144821.37280114, 190655.09704179,
       234033.59228674,  64554.00590459, 141797.85666906,  47931.73387586,
       203779.93737668, 205362.612358  , 111375.15395146, 174818.57402707,
       160631.98459096, 124626.39196943, 236492.47960043, 149109.61557874,
       231970.63692183,  79776.78415495, 128969.76457687, 237336.06494813,
       144046.34880282, 197827.02674887, 235145.05821013, 209939.23392851,
       209902.59768792, 250340.62248351, 212688.15372514, 220107.46579068,
       275984.07702424, 234488.6633213 , 165633.7567152 , 153377.00920117,
       131119.42394248, 134540.87665131, 252737.31326964, 208514.17986318,
       284351.82108952, 1

In [20]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
mse = mean_squared_error(y_test,y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test,y_pred)
print(mse)
print(rmse)
print(r2)

1717249853.4697752
41439.71348199424
0.6513483565229188
