### Machine Learning Process

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

StandardScaler --> it standardises the features by scaling them to have $\mu$ = 0 and $\sigma^2$ = 1

train_test_split - Split the dataset into training and test data.

accuracy_score - predict the accuracy of the classification model. It is the ratio of correct predictions to the total predictions. It is the important part of validation.


In [3]:
iris_data = pd.read_csv('iris.csv')
# print(iris_data) --> this will print the whole dataset
iris_data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [7]:
X = iris_data.drop(columns = ['Id','Species'])  #features
y = iris_data['Species']        #labels

In [8]:
#split the data into training and testing sets.

X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.2, random_state = 42)

1. **test_size = 0.2** - means 20% of the data is used for testing and 80% for the training.

2. **random_state=42**  - A seed value to make the split reproducible. If the above code is run         multiple times with the same random state, you will always get the same split. It is like the "unique ID tag" for how the data is split. 

In [9]:
#standardise the feature

scaler = StandardScaler()  #creating an instance of standardscaler
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [10]:
print(X_train_scaled)

[[-1.47393679  1.22037928 -1.5639872  -1.30948358]
 [-0.13307079  3.02001693 -1.27728011 -1.04292204]
 [ 1.08589829  0.09560575  0.38562104  0.28988568]
 [-1.23014297  0.77046987 -1.21993869 -1.30948358]
 [-1.7177306   0.32056046 -1.39196294 -1.30948358]
 [ 0.59831066 -1.25412249  0.72966956  0.95628954]
 [ 0.72020757  0.32056046  0.44296246  0.42316645]
 [-0.74255534  0.99542457 -1.27728011 -1.30948358]
 [-0.98634915  1.22037928 -1.33462153 -1.30948358]
 [-0.74255534  2.34515281 -1.27728011 -1.44276436]
 [-0.01117388 -0.80421307  0.78701097  0.95628954]
 [ 0.23261993  0.77046987  0.44296246  0.55644722]
 [ 1.08589829  0.09560575  0.5576453   0.42316645]
 [-0.49876152  1.8952434  -1.39196294 -1.04292204]
 [-0.49876152  1.44533399 -1.27728011 -1.30948358]
 [-0.37686461 -1.47907719 -0.01576889 -0.24323741]
 [ 0.59831066 -0.57925837  0.78701097  0.42316645]
 [ 0.72020757  0.09560575  1.01637665  0.82300877]
 [ 0.96400139 -0.12934896  0.38562104  0.28988568]
 [ 1.69538284  1.22037928  1.36

In [12]:
column_means = np.mean(X_train_scaled, axis=0)
column_variance = np.var(X_train_scaled, axis =0)
print("Mean of each column:", column_means)
print("Variance of each cloumn is:", column_variance)

Mean of each column: [6.51330841e-16 7.54951657e-16 1.92438658e-16 1.44328993e-16]
Variance of each cloumn is: [1. 1. 1. 1.]
