## Instructions {-}

1. This is the template you may use to submit your code and report for the prediction problems on Kaggle.

2. You may modify the template if you deem fit, but it should have the information asked below.

## A.1) Data cleaning

Mention the data cleaning steps taken to prepare your data for developing the model. This may include imputing missing values, dealing with outliers, combining levels of categorical variable(s), etc.

* Put your data cleaning/preparation code with comments here
* The code should begin from reading the train data
* The code should end when you obtain the data used to develop the model in A.4

In [413]:
import os
os.getcwd()
os.chdir("C:\Data")
os.getcwd()

'C:\\Data'

In [414]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrix
from pyearth import Earth
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import (mean_squared_error, r2_score, roc_curve, auc, precision_recall_curve, make_scorer,
                             recall_score, accuracy_score, precision_score, confusion_matrix)
from sklearn.model_selection import (cross_val_score, train_test_split, KFold, StratifiedKFold,
                                     GridSearchCV, ParameterGrid)
from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier
from sklearn.ensemble import (BaggingRegressor, BaggingClassifier, RandomForestRegressor, RandomForestClassifier, 
                              GradientBoostingRegressor, GradientBoostingClassifier, AdaBoostRegressor, AdaBoostClassifier, 
                              VotingRegressor, VotingClassifier, StackingRegressor, StackingClassifier)
from sklearn.linear_model import LinearRegression, LogisticRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
import itertools as it
import xgboost as xgb
import time as time
import random
from skimpy import clean_columns

#Libraries for visualizing trees
from sklearn.tree import export_graphviz 
from six import StringIO
from IPython.display import Image  
import pydotplus

In [698]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sample_submission = pd.read_csv('sample_submission.csv')
sample_submission.head()
y_train = train['y']

Dropping Categorical, but not the Nulls just yet

In [700]:
def function_clean(data):
    data = data.replace([np.inf], np.nan)
    data = data.replace([-np.inf], np.nan)
    
    categorical_columns = data.select_dtypes(include=['object']).columns
    data = data.drop(categorical_columns, axis=1)
    
    return data

train_cleaned = function_clean(train)
test_cleaned = function_clean(test)

## A.2) Exploratory data analysis

Mention any major insights you obtained from the data, which you used to develop the model. Please put your code or visualizations here if needed.

In [701]:
corr_matrix = train_cleaned.corr().sort_values(by = 'y', ascending = False)
corr_matrix['y']

y       1.000000
x146    0.378696
x102    0.378436
x014    0.364737
x581    0.346549
          ...   
x465         NaN
x518         NaN
x594         NaN
x643         NaN
x703         NaN
Name: y, Length: 756, dtype: float64

Whilst we still had our null values and the potential unrepresented data in there, I wanted to use KNN to predict the null values, but realized this would be difficult with the faint correlations of the correlation matrix.

I realize I could've use something like a PCA in the aim of something like this, or grouping certain variables via desicion trees into a single variable.

In [702]:
def drop_nulls(df):
    
    null_counts = df.isnull().sum()
    columns_to_drop = null_counts[null_counts > 50].index
    df_dropped = df.drop(columns=columns_to_drop)

    return df_dropped

test_cleaner = drop_nulls(test_cleaned)
train_cleaner = drop_nulls(train_cleaned)

For the rest I just replaced with averages to finally get rid of all the nulls

In [703]:
def replace_avg(data):
    for column in data.columns:
        avg = data[column].mean()
        data[column].fillna(avg, inplace=True)
    return data

train_avg = replace_avg(train_cleaner)
test_avg = replace_avg(test_cleaner)
test_avg.isnull().sum().sum()

0

In [704]:
overlap1 = set(test_avg.columns) - set(train_avg.columns)
overlap2 = set(train_avg.columns) - set(test_avg.columns)

print(overlap1)
print(overlap2)

true_test = test_avg.drop(columns = overlap1)
true_train = train_avg

{'x632', 'x182', 'x336', 'x274', 'x075', 'x573', 'x249', 'x178', 'x672', 'x120', 'x043', 'x546', 'x165', 'x588', 'x395', 'x228', 'x207', 'x544', 'x627', 'x760', 'x012', 'x136', 'x084', 'x679', 'x048', 'x500', 'x442', 'x009', 'x143', 'x393', 'x238', 'x329', 'x205', 'x267', 'x292', 'x386', 'x678', 'x351', 'x433', 'x050', 'x278', 'x660', 'x153', 'x719', 'x051', 'x313', 'x360', 'x186', 'x683', 'x762', 'x065', 'x314', 'x052', 'x071', 'x119', 'x705', 'x031', 'x141'}
{'y'}


I made sure to apply my data cleaning/analysis functions on the test just in case I needed the dataset to look at. overall, there was 618 columns for the train and 628 for the test, which suggests that the ladder half in the test data might have a different structure than the first half, but overall (only 10 columns lost), seem to be pretty similar.

In [705]:
null = train_avg.columns[train_avg.corr()['y'].isnull()]
true_test = true_test.drop(columns = null)
true_train = true_train.drop(columns = null)

true_train

Unnamed: 0,id,x001,x002,x003,x004,x005,x006,x007,x008,x010,...,x754,x755,x756,x757,x758,x759,x761,x763,x764,y
0,0,9.681860e+10,6991.15,7.76,0.00380,5.378811e+09,0.31,266117.20,934577.0,26900000000000,...,92.0,3.37,1.5707,0.0007,297281012,0.13,5,8.5127,14.28,5
1,1,3.304810e+09,13914.43,5.37,0.00015,1.652405e+09,0.00,11927742.92,1798051.0,169000000000000000,...,1026.0,2.40,0.1173,0.1136,3320000000000,0.08,0,1.5700,160.12,1
2,2,3.218944e+10,3991.98,5.77,0.00010,2.476111e+09,0.00,774385.01,375738.0,135000000000000,...,162.0,6.67,0.4582,0.0029,100474819,0.39,2,9.6800,25.06,11
3,3,1.288000e+10,15937.45,5.86,0.00020,2.146667e+09,0.00,6324375.16,1932094.0,37000000000000000,...,817.0,7.40,0.3816,0.0000,348000000000,0.25,1,4.5316,117.76,1
4,4,3.063412e+10,3621.00,7.52,0.00060,1.392460e+09,0.21,169860.29,474253.0,6000000000000,...,62.0,1.14,0.0100,0.0005,109546590,0.11,1,16.2717,5.81,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5375,5375,3.948791e+09,24563.46,6.73,0.00035,9.871977e+08,0.43,3303184.55,3154159.0,11900000000000000,...,685.0,15.10,1.3758,0.0000,158603315,0.05,0,2.7480,93.45,4
5376,5376,9.279017e+10,21572.94,6.96,0.00120,3.093006e+09,0.30,2649164.57,2934417.0,7220000000000000,...,558.0,4.38,0.2230,0.0003,36089167,0.01,4,23.6890,76.05,8
5377,5377,2.700359e+10,23061.73,6.36,0.00065,3.857656e+09,0.35,1825306.07,2395841.0,3960000000000000,...,650.0,1.87,0.1300,0.0057,1786891,0.53,0,4.3710,80.30,21
5378,5378,4.351107e+10,5739.04,7.80,0.00065,1.318517e+09,0.29,144103.12,715173.0,4150000000000,...,56.0,7.51,0.2719,0.0001,194000000000,0.29,2,24.6594,7.95,13


In [706]:
true_test

Unnamed: 0,id,x001,x002,x003,x004,x005,x006,x007,x008,x010,...,x753,x754,x755,x756,x757,x758,x759,x761,x763,x764
0,5380,6.507826e+10,7882.15,6.82,0.00210,1.712586e+09,0.39,583617.74,862986.0,147000000000000,...,131.17,202.0,4.01,0.0380,0.0010,90204869909,0.26,5,30.1213,27.95
1,5381,3.122741e+09,4682.13,8.17,0.00010,1.040914e+09,0.00,190000.65,688710.0,11300000000000,...,110.80,62.0,2.94,0.1866,0.0192,37449565014,0.02,1,2.1282,10.18
2,5382,3.888719e+10,7495.57,7.15,0.00285,2.160400e+09,0.42,351570.67,841523.0,41400000000000,...,127.10,140.0,1.45,0.0100,0.0017,10847937619,0.83,1,7.8513,21.27
3,5383,7.727427e+10,4003.76,6.53,0.00165,5.519591e+09,0.00,320216.05,466131.0,20800000000000,...,158.21,119.0,2.90,0.4636,0.0000,37200096,0.51,4,9.0511,18.38
4,5384,4.184868e+09,34874.72,6.39,0.00065,1.046217e+09,0.50,3349978.53,3711028.0,21100000000000000,...,129.62,1061.0,10.90,2.8737,0.0001,16400000000000,0.12,1,2.6512,149.68
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4398,9778,3.217682e+09,2214.42,5.27,0.00010,1.608841e+09,0.00,520766.78,172141.0,38700000000000,...,129.00,135.0,2.30,0.0771,0.0095,31446931515,0.74,0,1.6431,22.42
4399,9779,3.042820e+10,14279.29,6.18,0.00135,1.901762e+09,0.33,1634334.25,1420919.0,2330000000000000,...,122.70,474.0,4.13,0.2828,0.0000,18200000000000,0.42,2,12.3538,68.09
4400,9780,8.556628e+09,7518.26,6.74,0.00005,2.852209e+09,0.00,559939.70,911940.0,122000000000000,...,130.75,209.0,3.92,0.2890,0.0010,551000000000,0.67,1,2.5226,29.27
4401,9781,7.384902e+10,2556.73,6.47,0.00140,1.605414e+09,0.25,186175.29,236336.0,2960000000000,...,121.10,71.0,6.44,0.5227,0.0127,20500000000000,0.67,6,32.7632,10.22


In [707]:
corr_matrix = true_train.corr()

vif = pd.DataFrame(index=true_train.columns, columns=['VIF'])

for var in true_train.columns:
    mask = true_train.columns != var
    r_squared = 1.0 - corr_matrix.loc[var, mask].pow(2).sum()
    vif.loc[var, 'VIF'] = 1.0 / (1.0 - r_squared)
    
high_vif_variables = vif[vif['VIF'] > 3]
print(high_vif_variables)

#null = train_avg.columns[train_avg.corr()['y'].isnull()]
#true_test = true_test.drop(columns = null)
#true_train = true_train.drop(columns = null)

            VIF
id    10.903311
x402   8.413305


Due to the severe lack of VIF variables, and the presumable high dimensionality of the data I was working with, I attempted to pivot towards developing a PCA in order to reduce the dimensionality and the variables drastically.

I chose PCA because it is unsupervised seeing as it did not consider Y, unlike some other selection processes which consider proximity, etc etc. I was also having a lot of trouble selecting variables at this point so I decided to just start over as I wasn't sure if I was just cleaning significant variables.

In [708]:
test_true = true_test.drop(columns = ['id', 'x402'])
train_true = true_train.drop(columns = ['id','x402'])

logtrain = np.log(train_true)
logtest = np.log(test_true)

logtrain = function_clean(logtrain)
logtest = function_clean(logtest)

#logtrain = replace_avg(logtrain)
#logtest = replace_avg(logtest)

#X_train_scaled = scaler.fit_transform(X_train)
#X_test_scaled = scaler.transform(test_clean_true)
logtrain.isnull().sum().sort_values(ascending = False)

x102    5380
x570    5380
x687    5380
x651    5380
x631    5380
        ... 
x394       0
x391       0
x390       0
x002       0
y          0
Length: 616, dtype: int64

I think I decided to log the data because some of the values seemed extreme; but also it had the other effect of giving infinite values which I had an excuse to now clean.

I was also having some problems with dropna, so I just decided to make a function for myself.

In [709]:
def drop_nulls(df):
    
    null_counts = df.isnull().sum()
    columns_to_drop = null_counts[null_counts > 20].index
    df_dropped = df.drop(columns=columns_to_drop)

    return df_dropped

train_dropped = drop_nulls(logtrain)
test_dropped = drop_nulls(logtest)

train = replace_avg(train_dropped)
test = replace_avg(test_dropped)

In [731]:
overlap1 = set(test.columns) - set(train.columns)
overlap2 = set(train.columns) - set(test.columns)

_test = test.drop(columns = overlap1)
_train = train.drop(columns = overlap2)
_ytrain = train['y']

I now have my training and test data, filtered down to only 357 columns but many more to go.

## A.3) Feature selection/reduction

Mention the steps for feature selection/reduction. Please put your code or visualizations here if needed.

- I felt most comfortable with Neural Networks, so I used a PCA + Neural Ensemble Model to try and feature select.

In [770]:
import tensorflow as tf
tf.config.run_functions_eagerly(True)
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras import backend as K
from tensorflow.keras import regularizers
from tensorflow.keras.layers import LSTM, MaxPooling1D, Conv1D, Conv2D
from tensorflow.keras.layers import Bidirectional, MaxPooling2D 
from tensorflow.keras.layers import Flatten, Dense, Dropout, GRU

In [771]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(_train)
X_test_scaled = scaler.transform(_test)

pca = PCA(n_components=.99)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

print("Explained Variance Ratio:", pca.explained_variance_ratio_)

Explained Variance Ratio: [0.27962705 0.16901551 0.11529849 0.0724644  0.04958262 0.03393412
 0.02692434 0.01961699 0.01933426 0.01330721 0.0127878  0.01053347
 0.00877735 0.00833204 0.00771909 0.00660525 0.00635235 0.00591934
 0.00539327 0.00498733 0.00489137 0.00459015 0.00435828 0.00419988
 0.0040785  0.0040049  0.00357149 0.00339706 0.003222   0.00308328
 0.00296318 0.00293465 0.00284362 0.00277817 0.0026599  0.00262823
 0.00245465 0.0023865  0.0023454  0.00227897 0.00216725 0.00214226
 0.00213617 0.00208896 0.00197638 0.00193201 0.00181369 0.00179038
 0.00169091 0.00161075 0.00155345 0.00149478 0.00144642 0.00142061
 0.00139727 0.00134688 0.00130079 0.00129026 0.0012299  0.00118318
 0.00114639 0.00110025 0.00106091 0.00096858 0.00093972 0.00090459
 0.00088482 0.00083055 0.00080745 0.00077907 0.00070803 0.00067277
 0.00066672 0.00064352 0.00061853 0.00060275 0.00057887 0.00056499
 0.00051498]


In [772]:
X_train_pca.shape

(5380, 79)

In [773]:
X_test_pca.shape

(4403, 79)

Ending up with 79 variables to explain 99% of the variance.

## A.4) Developing the model

Mention the logical sequence of steps taken to obtain the final model. 

- Neural Network Attempt no. 1
- Model 2

My first concern here was the feature selection. I was using neural networks to

One of my first suggestions that came was to the model was to select the top couple of variables with respect to the correlation, but 

# Single Neural Network Model using PCA data

I got suspicions that the linear activation was performing suspiciously well, but I just kept trying neural network models to see if it could perform ok. 

In [795]:
X_train, X_test, y_train, y_test = train_test_split(X_train_pca, _ytrain, 
                                                    test_size=0.81829, random_state=36)
print(y_test.shape)
print(X_test.shape)
print(y_train.shape)
print(X_train.shape)

(4403,)
(4403, 79)
(977,)
(977, 79)


In [796]:
model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(X_test_pca.shape[1],)))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(1))

# Compile the model
model.compile(optimizer='adam', loss='MeanSquaredError', metrics=['RootMeanSquaredError'])

In [797]:
history = model.fit(X_train_pca, _ytrain, 
                    epochs=10, batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [798]:
# Predict on the validation set
y_pred = model.predict(X_test_pca)

# Evaluate the model on the validation set
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error on Validation Set:", rmse)

Root Mean Squared Error on Validation Set: 1.2761605820333004


In [842]:
len(y_pred)
to_merge = pd.DataFrame(y_pred)
merge = pd.concat([true_test['id'], to_merge], axis = 1)
merge

Unnamed: 0,id,0
0,5380,2.074097
1,5381,2.059591
2,5382,1.228719
3,5383,0.953873
4,5384,0.652356
...,...,...
4398,9778,1.354312
4399,9779,1.727634
4400,9780,2.806687
4401,9781,1.601340


In [843]:
merge.columns =['id', 'y']

In [839]:
merge['y'] = merge['y'].apply(lambda x: np.exp(x))

In [841]:
submission = merge
submission

Unnamed: 0,id,y
0,5380,7.957357
1,5381,7.842758
2,5382,3.416849
3,5383,2.595744
4,5384,1.920058
...,...,...
4398,9778,3.874095
4399,9779,5.627323
4400,9780,16.554986
4401,9781,4.959676


In [844]:
merge.to_csv('y_pred.csv', index=False)

## A.5) Discussion

Please provide details of the models/approaches you attempted but encountered challenges or unfavorable outcomes. If feasible, kindly explain the reasons behind their ineffectiveness or lack of success. Additionally, highlight the significant challenges or issues you encountered during the process.

I feel like I had a hard time with some feature selection models like KNN, random forest, and gradient boosting because they just took a long time with my data without great results. Once I thought about it theoritically, things like KNN made sense on why they didn't work, because the RMSE wasn't necessarily based opon the number of neighbors I had assigned. I could have used a technique like gridsearch to find the most optimal hyperparameters but I feel like having seen great results with Neural networks in my STAT 362 class, I could use that much more comfortably than other ones.

## A.6) Conclusion

* Do you feel that you gain valuable experience, skills, and/or knowledge? If yes, please explain what they were. If no, please explain.
* What are things you liked/disliked about the project and/or work on the project?

I feel like I learned so many new important things about machine learning that I wouldn't have been able to express in any other way. I think learning with this dataset, with no strict linear or dimensional pattern has been very helpful to my understanding of data science and statistics in general. For example, I feel like data cleaning desicions were very hard to make, and a lot of the traditional data cleaning desicions I think we make on various kinds of datasets.

I also feel like I was able to combine my knowledge from various classes from 303-3, 362, and other classes in a way that felt so real world for me, and in a way that made all the math and other statistical stuff I learned feel relevant.

## Please make sure your github repo has all the code and  ensure that your code is capable of reproducing the outcomes you have submitted. It is important to avoid any form of academic misconduct or cheating by using your peer's submission file