# ML Analysis

## Imports

- sklearn
- numpy

In [2]:
import sklearn
import numpy as np

## Reading Data into Datasets

To make storing and using our data more efficient, and because `sklearn` uses numpy, we will load our data into numpy arrays instead of regular python arrays. Luckily, this is pretty easily done using the `loadtext` numpy function.

Let's create a utility function so we can easily grab $X$ and $y$.

In [20]:
BATCH_DIRNAME = 'batches'

def get_batch(batch: str) -> tuple[np.array, np.array]:
    X = np.loadtxt(f'{BATCH_DIRNAME}\\{batch}.csv', delimiter=',', usecols=(1,2,3,4,5,6,7,8))
    y = np.loadtxt(f'{BATCH_DIRNAME}\\{batch}.csv', delimiter=',', usecols=(9))
    return X, y

In [21]:
X, y = get_batch('all')
print('first line:')
print('features:', X[0], 'label:', y[0])
print('\nlast line:')
print('features:', X[-1], 'label:', y[-1])
print('\nfourth last line:')
print('features:', X[-4], 'label:', y[-4])

first line:
features: [1210.40492639 1213.02863786 1171.05709319 1252.43329795 1311.78709141
 1134.0426267  1224.20189759 1174.29643889] label: 0.0

last line:
features: [1224.53449101 1229.01082112 1192.25245912 1155.44636828 1251.97486078
 1309.55508442 1161.8690622  1198.93541254] label: 0.0

fourth last line:
features: [1247.5890092  1148.12047806 1339.2792588  1325.48466976 1441.76044124
 1219.12120178 1328.84213757 1195.64984241] label: 1.0


## Logistic Regression

The first algorithm we will use to analyze the data is using logistic regression.

In [22]:
from sklearn.linear_model import LogisticRegression
X, y = get_batch('all')
clf = LogisticRegression().fit(X, y)

In [33]:
print(clf.predict(X[:5, :]))
print(clf.predict_proba(X[:5, :]))

print()

print(clf.predict(X[-5:, :]))
print(clf.predict_proba(X[-5:, :]))

print()
print('score:', clf.score(X,y))

[1. 0. 1. 0. 1.]
[[0.48386419 0.51613581]
 [0.63631611 0.36368389]
 [0.27485614 0.72514386]
 [0.5875905  0.4124095 ]
 [0.3587525  0.6412475 ]]

[1. 1. 0. 1. 1.]
[[0.243994   0.756006  ]
 [0.32961014 0.67038986]
 [0.87589573 0.12410427]
 [0.37776647 0.62223353]
 [0.35138978 0.64861022]]

score: 0.7194719471947195


In [38]:
X1, y1 = get_batch('season_1')
clf1 = LogisticRegression().fit(X, y)

print('score:', clf.score(X1, y1))

score: 0.7519704433497537


In [48]:
print(clf.predict_proba([[1500,1500,1500,1500,1500,1500,1500,1500],
                         [1100,1300,1700,1900,1500,1500,1500,1500],
                         [1500,1500,1500,1500,1100,1300,1700,1900],
                         [1400,1400,1800,1400,1500,1500,1500,1500],
                         [1500,1500,1500,1500,1400,1400,1800,1400]]))

[[0.47894058 0.52105942]
 [0.43918388 0.56081612]
 [0.45823215 0.54176785]
 [0.49712757 0.50287243]
 [0.42695033 0.57304967]]
