In [97]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression, SGDRegressor
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

verbose_level = 1
cv_folds = 3
train_score = False
percent_sample = .3

import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', category=RuntimeWarning)
warnings.filterwarnings('ignore', category=ImportWarning)
warnings.filterwarnings('ignore', category=PendingDeprecationWarning)
warnings.filterwarnings('ignore', category=ResourceWarning)
warnings.filterwarnings('ignore', category=Warning)

############################################
#### Get Data ##############################
### Please don't change this part below ####
IN_COLAB = 'google.colab' in str(get_ipython())

if IN_COLAB:
    !wget https://github.com/AkeemSemper/ML_for_Non_DS_Students/raw/main/data/proj_reg_train.csv.cpgz
    !wget https://github.com/AkeemSemper/ML_for_Non_DS_Students/raw/main/data/proj_train.csv
    !wget https://github.com/AkeemSemper/ML_for_Non_DS_Students/raw/main/data/proj_train_labels.csv
    !wget https://github.com/AkeemSemper/ML_for_Non_DS_Students/raw/main/data/proj_test.csv
    df_train_data = pd.read_csv('proj_train.csv')
    df_train_labels = pd.read_csv('proj_train_labels.csv')
    df_test = pd.read_csv('proj_test.csv')
    df_reg_train = pd.read_csv('proj_reg_train.csv.cpgz', compression='gzip', encoding='utf-8', encoding_errors='ignore')
else:
    df_train_data = pd.read_csv('../data/proj_train.csv')
    df_train_labels = pd.read_csv('../data/proj_train_labels.csv')
    df_test = pd.read_csv('../data/proj_test.csv')
    df_reg_train = pd.read_csv('../data/proj_reg_train.csv.cpgz', compression='gzip', encoding='utf-8', encoding_errors='ignore')

header_row = ["feature_" + str(i) for i in range(1, len(df_train_data.columns) + 1)]
df_train_data.columns = header_row
label_header = ["target"]
df_train_labels.columns = label_header
df_test.columns = header_row
df_train = pd.merge(df_train_data, df_train_labels, left_index=True, right_index=True)

df_reg_train.rename(columns={'0707077777770000011006640007650000240000010000001364250462400002300376346204proj_reg_train.csv':"ID"}, inplace=True)
df_reg_train.drop('ID', axis=1, inplace=True)
df_reg_train = df_reg_train.sample(frac=percent_sample)

# Project

Predict the target value for each of the datasets below. One is a classification problem, the other is a regression problem. Please ensure that you read the instructions in each of the text blocks carefully.

This isn't a challenge to arrange all of the things we mentioned in class into this assignment in some location. You have a goal of making accurate predictions, and beyond the basic rules of 'how things work' you're free, and encouraged, to use any tools or techniques you think will help you achieve that goal, just document why. There isn't one specific answer here, many models will do the job, you want to use what you know to search for the best one. Other libraries, tools, techniques, etc. are all fair game if you think they will help you achieve the goal. You're not mandated to search out anything new to do this, but that is honestly one of the best ways of learning, so if you feel confident, it is a great opportunity to try something new while still having a chance to ask questions if you get stuck.

In particular, I'd encourage you tho think of things systematically from the beginning. You know that you'll need to test several models over and over, setting up data pipelines, look at results, keep track of things, etc... as you go. There are likely pretty simple things you might be able to build up front that will save you a lot of time and effort in the long run. For example, if I was doing this project, I'd want to compare my results between different models, such as regression and tree ones (try others too, if you want). There's probably a way to combine some of the common data prep steps for each of those models; there also might be a way to keep track of their results in a nice simple way, in some previous workbooks I built shared pipelines for different models that did all of the preprocessing without having to redo them... Lots of these things are simple and easy to do, we just need to think of them. Anything that we might need to track or change by hand can likely be automated. 

## Marks Distribution

The marks for this will be split 50/50 between the two models, each being graded as follows:
<ul>
<li> Explaination of process: 50% </li>
    <ul>
    <li> For each model, create a baseline model first (with no tuning), as a comparison point. </li>
    <li> What did you do for data prep/processing, and why? </li>
    <li> What model did you choose and what are the reasons that it was selected? </li>
    <li> What did you do to optimize/tune/improve the model? </li> 
    </ul>
<li> Code Quality: 30% </li>
<li> Model Performance: 20% </li>
</ul>

## Interpreting the Results

Please complete the following sections with your findings. You can use the bullet lists I started here to enter the answers - Google "colab edit markdown" for a guide on how to format text in markdown if you are not familiar.

### Classification Model

For the classification model, please be sure to include the CSV of your predictions for the test data. 

#### What Were the Results and Your Reasoning?
<ul>
<li>
</ul>

#### What Did You do to Optimize the Model?
<ul>
<li>
</ul>

#### What Did You do to Explore/Process the Data?
<ul>
<li>
</ul>


### Regression Model

#### What Were the Results and Your Reasoning?
<ul>
<li>
</ul>

#### What Did You do to Optimize the Model?
<ul>
<li>
</ul>

#### What Did You do to Explore/Process the Data?
<ul>
<li>
</ul>

## Load Classification Data

In [98]:
#df_train.head()
df_train.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,...,feature_92,feature_93,feature_94,feature_95,feature_96,feature_97,feature_98,feature_99,feature_100,target
0,0.050086,0.11158,94.08,1.765,0.089417,4.8047,0.26742,,0.56473,0.035123,...,4.1687,0.075432,0.010869,0.063972,0.079892,1.9795,3.5064,0.072132,0.09195,1
1,0.088447,2.3634,5.058,0.14436,0.064547,2.444,4.2545,0.36506,1.8609,0.009759,...,4.5613,0.046505,,0.084066,0.064829,3.3087,2.9969,0.064328,0.036793,0
2,0.77254,0.59469,,0.97515,0.015987,0.52884,1.4884,3.961,4.8063,0.048617,...,0.12832,0.065028,0.036862,0.01001,0.020709,2.5237,2.1711,0.080865,0.081553,0
3,0.38241,4.8109,1955.1,0.4605,0.024453,2.0298,3.7403,4.2281,2.4292,0.15683,...,4.3701,1.0011,0.06575,0.043547,0.62943,4.6262,3.1947,,0.18718,1
4,0.081316,4.8415,4.0507,2.4832,0.05899,2.3794,1.6127,2.0422,1.6571,0.039377,...,2.6804,0.076524,0.082756,0.041953,0.018092,3.3041,0.1922,0.0326,0.050172,0


## Start Here

### Exploration and Preprocessing

### Modelling

#### Results

## Test Data and Output

Please load the test data and make predictions for the target variable. You'll need to:
<ul>
<li> Load the test data </li>
<li> Preprocess the test data. This means that whatever was fed to your model's X values to train it, need to be the same here. That includes scaling, removing columns, etc... </li>
<li> Make predictions </li>
<li> Save the predictions to a CSV file </li>
</ul>

In [105]:
df_test.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,...,feature_91,feature_92,feature_93,feature_94,feature_95,feature_96,feature_97,feature_98,feature_99,feature_100
0,0.083796,3.9269,80.059,1.878,0.064623,4.1468,3.7274,0.10988,1.9025,0.093267,...,0.046586,2.9683,0.07912,0.07963,0.026689,0.066626,1.9446,2.2163,0.099828,0.088621
1,0.067591,3.0107,25.03,1.2643,3.0311,3.0969,1.7821,1.4346,1.8817,0.00206,...,0.099853,1.2933,0.015907,0.061634,0.07921,0.022622,2.0196,3.443,0.069516,0.025426
2,0.06483,0.1196,5.0957,1.9366,0.028638,1.2147,0.34675,1.1188,0.99529,0.060067,...,0.095575,4.4767,,0.06734,0.05428,0.003331,0.95697,3.697,0.049861,0.07314
3,0.34572,1.4261,44.041,2.7656,0.095754,1.7089,3.9449,0.1182,1.5026,0.030124,...,0.21012,1.2719,0.028321,0.086113,0.007885,0.5282,1.8746,2.1488,0.030729,0.020493
4,0.080434,0.88394,30.033,4.139,0.021173,4.7753,2.8376,3.5691,4.8478,0.034984,...,0.052596,0.7264,0.08529,0.084134,0.015853,0.084836,4.4584,3.9588,0.091619,0.007432


##### Example of Saving CSV

In [106]:
# Yours should be very close to this. The one exception is that if you're doing data
# cleaning in your training data outside of the pipeline, you should do the same for the test data
# before it is fed in. 

#class_results = class_model.predict(df_test)
#class_res_df = pd.DataFrame(class_results, columns=['target'])
#class_res_df.to_csv('class_results.csv')
#class_results

array([0, 0, 0, ..., 0, 1, 0])

### Data Exploration and Processing

# Regression

Predict the target. You may need to do some processing to make things work well - nothing advanced, just looking at the data and making reasonable decisions. 

### Load Data

In [107]:
# Data
df_reg_train.head()

Unnamed: 0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,col_10,...,col_122,col_123,col_124,col_125,col_126,col_127,col_128,col_129,col_130,target
143412,A,B,A,A,A,A,A,A,A,B,...,0.337155,0.340247,0.26847,0.41471,0.44467,0.341813,0.335036,0.318646,0.83659,2498.35
130309,A,B,A,A,B,A,A,A,B,A,...,0.186254,0.327081,0.27797,0.32128,0.24355,0.180456,0.178698,0.30435,0.220998,4331.96
141370,A,B,A,A,A,A,A,A,B,B,...,0.672862,0.551054,0.34445,0.44767,0.53881,0.4922,0.481306,0.654753,0.381055,2794.88
128155,A,B,A,A,A,A,A,A,B,A,...,0.340845,0.355246,0.3128,0.39849,0.41743,0.396226,0.387819,0.351299,0.709024,4155.7
134893,A,B,A,B,A,A,A,A,B,A,...,0.344288,0.314714,0.95332,0.32865,0.3148,0.272329,0.267742,0.48667,0.813385,3841.98


### Explore and Process

### Model

### Results