## Project 01_2023

**Lance Cole**

**DSCI 35600 - Machine Learning**

## Part A: Import Packages and Load Dataset

In the cell below, import the following packages using the standard aliases: `numpy`, `matplotlib.pyplot`, and `pandas`. Also import the following classes and functions from `sklearn`: `train_test_split`,  `LinearRegression`, `PolynomialFeatures`,  `LogisticRegression`, `StandardScaler`, and `OneHotEncoder`.  

In [9]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

Use `pandas` to load the contents of the tab-separated file `Project01_data.txt` into a dataframe called `df`. Display the first 10 rows of this dataframe.

In [10]:
df = pd.read_csv('Project01_data.txt', sep='\t')
df.head(10)

Unnamed: 0,F1,F2,F3,F4,F5,y
0,15.69,-0.771,550.880459,P,D,0
1,-16.81,1.959,588.523801,Q,C,3
2,21.09,-1.55,660.881834,P,B,2
3,15.64,-1.623,374.414543,Q,C,0
4,14.25,1.426,446.71412,Q,B,0
5,21.54,1.231,525.126448,P,D,2
6,-14.05,1.608,343.26432,P,B,3
7,-21.52,-1.858,549.753447,Q,B,1
8,12.31,-0.941,507.148376,Q,D,0
9,-24.83,-1.94,627.0401,Q,B,1


Your goal in this assignment will be to use features F1 - F5 to predict one of four possible values for y: 0, 1, 2, or 3. 

## Part B: Preparing the Data

Using Lecture6, in the cell below, create the following arrays:

* `X_num` should contain the columns of `df` associated with numerical variables. 
* `X_cat` should contain the columns of `df` associated with categorical variables. 
* `y` should be a 1D array contain the values of the label, `y`. 

Print first 3 rows of each of these three arrays.

In [11]:
X_num = df.iloc[:,:3].values
X_cat = df.iloc[:,[3,4]].values
y = df.iloc[:,-1]

print(X_num.shape)
print(X_cat.shape)
print(y.shape)

(467, 3)
(467, 2)
(467,)


#### Numerical Features
Split `Xnum` into training and validation sets called `X_num_train` and `X_num_val`. Use an 80/20 split, and set `random_state=1`. 

Then use the `StandardScaler` class to scale the numerical data. Name the resulting arrays `X_sca_train` and `X_sca_val`. Print the shape of these two arrays. 
Print the top 5 rows of `X_sca_train`.

In [25]:
X_num_train, X_num_val, y_num_train, y_num_val = train_test_split(X_num, y, test_size=0.2, random_state=1)
s_scaler = StandardScaler()

X_sca_train = s_scaler.fit_transform(X_num_train)
X_sca_val = s_scaler.fit_transform(X_num_val)

print(X_sca_train.shape)
print(X_sca_val.shape)

print("X_sca_train = ")
print(X_sca_train)

(373, 3)
(94, 3)
X_sca_train = 
[[-1.34790978 -0.85473365 -0.03279194]
 [-0.70984609 -0.4799912   0.9331761 ]
 [-0.98459949  1.88147176  2.06401676]
 ...
 [ 0.95173081 -0.89923432 -0.40795084]
 [ 0.7468011   0.98560309 -0.96206722]
 [-0.15636557  1.0558673  -0.28997942]]


#### Categorical Features

Use the `OneHotEncoder` class to encode the categorical feature array (setting `sparse=False`). Store the results in an array called `X_enc`. 

Split `X_enc` into training and validation sets called `X_enc_train` and `X_enc_val`. Use an 80/20 split, and set `random_state=1`. Print the shapes of these two arrays.

In [26]:
enc = OneHotEncoder(sparse=False)
X_enc = enc.fit_transform(X_cat)

X_enc_train, X_enc_val, y_enc_train, y_enc_val = train_test_split(X_enc, y, test_size=0.2, random_state=1)

print(X_enc_train.shape)
print(X_enc_val.shape)


(373, 6)
(94, 6)


#### Combine Numerical and Categorial Features

Use `np.hstack()` to combine `X_sca_train` and `X_enc_train` into an array called `X_train`. Then combine `X_sca_val` and `X_enc_val` into an array called `X_val`. Print the shapes of the two new arrays.

In [27]:
X_train = np.hstack([X_sca_train, X_enc_train])
X_val = np.hstack([X_sca_val, X_enc_val])
print(X_val.shape)
print(X_train.shape)

(94, 9)
(373, 9)


## Part C: Polynomial Regression Model

Using lecture 8, in the cell below create and fit polynomial models with degrees 1,3,4,5,7.  Fit the models and compute the training and validation scores for each.  Do not plot anything.


In [36]:
degree = [1, 3, 5, 7, 9, 11]
x_curve = np.linspace(-4, 4, 100)
tr_score = []
va_score = []

np.random.seed(1)
n = 40
x = np.random.uniform(-4, 4, n)
X = x.reshape(-1,1)
y =  0.3 + 0.05 * x + 0.001 * x**7 + np.random.normal(0, 2, n)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.4, random_state=1)

for i in range(len(degree)):
    
    poly = PolynomialFeatures(degree[i])
    Xp_train = poly.fit_transform(X_train)
    Xp_val = poly.fit_transform(X_val)
    
    mod = LinearRegression()
    mod.fit(Xp_train, y_train)
    
    tr_score.append(mod.score(Xp_train, y_train))
    va_score.append(mod.score(Xp_val, y_val))

    xp_curve = poly.fit_transform(x_curve.reshape(-1,1))
    y_curve = mod.predict(xp_curve)
    
    print('-- Degree', degree[i], '--')
    print('Training r2:  ', tr_score[i])
    print('Validation r2:', va_score[i], '\n')

-- Degree 1 --
Training r2:   0.6382751443295913
Validation r2: 0.5334020784605253 

-- Degree 3 --
Training r2:   0.8228042419821453
Validation r2: 0.8224250413792191 

-- Degree 5 --
Training r2:   0.8508252764216662
Validation r2: 0.8788413864576491 

-- Degree 7 --
Training r2:   0.8728242410836947
Validation r2: 0.12019555086909062 

-- Degree 9 --
Training r2:   0.8899490344773456
Validation r2: -0.10629843366074265 

-- Degree 11 --
Training r2:   0.9108177814282042
Validation r2: -101.55739174294352 



## Use the best model to predict some values 
Pick the model with the best validation score and use it to predict the y-values from the top 5 rows of Xp_val.   Print out both the predicted values and the actual values from y_val.  Don't forget to round off the predicted y-values using np.round().

## more predictions...
Finally predict the y values for the following set of data:

`F1      F2      F3     F4  F5`

`10.6  -0.9 	650.9   Q   D `

Don't forget the you have to prepare the data like you did in part B