## **MileStone_2**

Make a Baseline learning notebook carrying, some sort of linear or logistic regression (to be used as a benchmark; feel free to use sklearn).  Details left to you, but explain what you are doing in text cells in the notebook.


## **Import Packages & Functions**
The "getfile" function splits the loaded dataset into debuggin_dataset & working_dataset by a certain year.

In [1]:
import pandas
import numpy
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import warnings

def getfile(location_pair,**kwargs):
    (loc,gdrive)=location_pair
    loc = 'https://drive.google.com/uc?export=download&id='+gdrive.split('/')[-2]

    #Convert these datasets to pandas
    raw_dataset=pandas.read_csv(loc,**kwargs)
    #Convert datetime to pandas timestamps
    raw_dataset['DATE'] = pandas.to_datetime(raw_dataset['DATE'], format='%Y / %m')

    ## seperate the data by "year"
    split_date = '1992-01-01'
    debugging_dataset = raw_dataset[raw_dataset['DATE'] < split_date] #1990-1991 (2 years)
    debugging_dataset = debugging_dataset.sort_values(by='DATE', ascending=True)
    debugging_dataset = debugging_dataset.reset_index(drop=True) # reset the data index

    working_dataset = raw_dataset[raw_dataset['DATE'] >= split_date] #1992-2023 (~32 years)
    working_dataset = working_dataset.sort_values(by='DATE', ascending=True)
    working_dataset = working_dataset.reset_index(drop=True) # reset the data index

    return raw_dataset, debugging_dataset, working_dataset

# **Load the dataset from CSV files**
There are two datasets we download from open-sources: New York population_weather data & New York tonnage data.
We merged these two datasets before generating the debugging and working datasets.  


In [2]:
# **Import the CSV files**
fname_1=("NY_Population_Weather_Data.csv","https://drive.google.com/file/d/1DkB88nrVF2B60Rjxi_7fzRnD31-LUqXH/view?usp=sharing") #Load the New York population&weather dataset.
fname_2=("NY_Tonnage_Data_v2.csv","https://drive.google.com/file/d/1-57Sr-WC3g5MRxSXCun6HpD-SB1C-B6Q/view?usp=sharing") #Load the New York tonnage dataset.

raw_dataset_1, debugging_dataset_1, working_dataset_1=getfile(fname_1)
raw_dataset_2, debugging_dataset_2, working_dataset_2=getfile(fname_2)
print("NY_Population_Weather_Data_dataset dimension:", raw_dataset_1.shape)
print("NY_Tonnage_Data_dataset dimension:", raw_dataset_2.shape)

### **Merge the working_dataset:**
working_dataset = pandas.merge(working_dataset_1, working_dataset_2, on=["DATE", "BOROUGH"], how='right')
working_dataset.to_pickle('./working_dataset.pkl')#Pickle the data
#print(debugging_dataset.head())

### **Merge the debug_dataset:**
debugging_dataset = pandas.merge(debugging_dataset_1, debugging_dataset_2, on=["DATE", "BOROUGH"], how='right')
debugging_dataset.to_pickle('./debugging_dataset.pkl')#Pickle the data
debugging_dataset.head()

NY_Population_Weather_Data_dataset dimension: (2023, 11)
NY_Tonnage_Data_dataset dimension: (1983, 9)


Unnamed: 0,DATE,BOROUGH,POPULATION,POPULATION PERCENTAGE,AWND,PRCP,SNOW,TAVG,TMAX,TMIN,TSUN,REFUSETONSCOLLECTED,PAPERTONSCOLLECTED,MGPTONSCOLLECTED,RESORGANICSTONS,SCHOOLORGANICTONS,LEAVESORGANICTONS,XMASTREETONS
0,1990-01-01,Manhattan,1487536,20.31,,135.8,45.0,5.2,8.64,1.75,9018.0,24.4,0.0,0.0,0.0,0.0,0.0,0.0
1,1990-01-01,Queens,1951598,26.65,,135.8,45.0,5.2,8.64,1.75,9018.0,8.4,0.0,0.0,0.0,0.0,0.0,0.0
2,1990-01-01,Staten Island,378977,5.18,,135.8,45.0,5.2,8.64,1.75,9018.0,39.1,0.0,0.0,0.0,0.0,0.0,0.0
3,1990-06-01,Staten Island,378977,5.18,,63.6,0.0,22.28,27.47,17.1,16853.0,11518.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1990-07-01,Staten Island,378977,5.18,,89.2,0.0,24.86,29.28,20.45,12776.0,15243.8,0.0,0.0,0.0,0.0,0.0,0.0


## **Data pre-process on the debugging dataset(one-hot, imputer, and scaler)**
Because the type of Borough data was stored in 'String' instead of numerical values, we applied one-hot to split the data into multiple columns.
The Data data was split into "year" and "month" for representing.

In [3]:
#Apply One-Hot to separate the BOROUGH types & DATE in columns
debugging_dataset = pandas.get_dummies(debugging_dataset, columns=['BOROUGH'], prefix=['BOROUGH'])
debugging_dataset['DATE'] = pandas.to_datetime(debugging_dataset['DATE']) # separate the date into two columns, "year" & "month"
debugging_dataset['YEAR'] = debugging_dataset['DATE'].dt.year #add new column "year"
debugging_dataset['MONTH'] = debugging_dataset['DATE'].dt.month #add new column "month"
debugging_dataset = debugging_dataset.drop(columns=['DATE']) #drop the "DATE" columns
debugging_dataset.head()

Unnamed: 0,POPULATION,POPULATION PERCENTAGE,AWND,PRCP,SNOW,TAVG,TMAX,TMIN,TSUN,REFUSETONSCOLLECTED,...,SCHOOLORGANICTONS,LEAVESORGANICTONS,XMASTREETONS,BOROUGH_Bronx,BOROUGH_Brooklyn,BOROUGH_Manhattan,BOROUGH_Queens,BOROUGH_Staten Island,YEAR,MONTH
0,1487536,20.31,,135.8,45.0,5.2,8.64,1.75,9018.0,24.4,...,0.0,0.0,0.0,0,0,1,0,0,1990,1
1,1951598,26.65,,135.8,45.0,5.2,8.64,1.75,9018.0,8.4,...,0.0,0.0,0.0,0,0,0,1,0,1990,1
2,378977,5.18,,135.8,45.0,5.2,8.64,1.75,9018.0,39.1,...,0.0,0.0,0.0,0,0,0,0,1,1990,1
3,378977,5.18,,63.6,0.0,22.28,27.47,17.1,16853.0,11518.0,...,0.0,0.0,0.0,0,0,0,0,1,1990,6
4,378977,5.18,,89.2,0.0,24.86,29.28,20.45,12776.0,15243.8,...,0.0,0.0,0.0,0,0,0,0,1,1990,7


### **Generate the input features and targeting label for training**
X is the feed-in feature (16 features)

Y is the targeting label (7 types of trash)

In [4]:
### **Replace the 'nan' values & normalize the values**
imputer = SimpleImputer(missing_values=numpy.nan, strategy='constant', fill_value=0) #replace the 'nan' value to be 0
scaler = StandardScaler() #standard normalization

### **Generate the input features (X)**
X = debugging_dataset.drop(columns=['REFUSETONSCOLLECTED','PAPERTONSCOLLECTED','MGPTONSCOLLECTED','RESORGANICSTONS','SCHOOLORGANICTONS','LEAVESORGANICTONS',	'XMASTREETONS']) #only remain the input features we need
X[['AWND','PRCP','SNOW','TAVG','TMAX','TMIN','TSUN']] = imputer.fit_transform(X[['AWND','PRCP','SNOW','TAVG','TMAX','TMIN','TSUN']])  #replace the 'nan' value to be 0
X[['POPULATION','POPULATION PERCENTAGE','AWND','PRCP','SNOW','TAVG','TMAX','TMIN','TSUN','MONTH','YEAR']] = scaler.fit_transform(X[['POPULATION','POPULATION PERCENTAGE','AWND','PRCP','SNOW','TAVG','TMAX','TMIN','TSUN','MONTH','YEAR']]) #standard scale

print('Features shape:',X.shape)
print('-----------------------------------------------------------------------------------')
X

Features shape: (78, 16)
-----------------------------------------------------------------------------------


Unnamed: 0,POPULATION,POPULATION PERCENTAGE,AWND,PRCP,SNOW,TAVG,TMAX,TMIN,TSUN,BOROUGH_Bronx,BOROUGH_Brooklyn,BOROUGH_Manhattan,BOROUGH_Queens,BOROUGH_Staten Island,YEAR,MONTH
0,0.158353,0.157907,0.0,0.617946,0.078868,-1.082467,-1.136736,-1.018932,-1.040473,0,0,1,0,0,-1.825742,-1.651990
1,0.841237,0.841214,0.0,0.617946,0.078868,-1.082467,-1.136736,-1.018932,-1.040473,0,0,0,1,0,-1.825742,-1.651990
2,-1.472932,-1.472759,0.0,0.617946,0.078868,-1.082467,-1.136736,-1.018932,-1.040473,0,0,0,0,1,-1.825742,-1.651990
3,-1.472932,-1.472759,0.0,-0.817412,-0.490732,1.031004,1.036758,1.023538,1.097957,0,0,0,0,1,-1.825742,-0.254428
4,-1.472932,-1.472759,0.0,-0.308476,-0.490732,1.350252,1.245682,1.469289,-0.014791,0,0,0,0,1,-1.825742,0.025084
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73,0.841237,0.841214,0.0,0.069250,-0.262892,-1.206207,-1.217535,-1.189249,-1.116621,0,0,0,1,0,0.547723,1.422647
74,-0.259191,-0.259190,0.0,0.069250,-0.262892,-1.206207,-1.217535,-1.189249,-1.116621,1,0,0,0,0,0.547723,1.422647
75,1.354900,1.355310,0.0,0.069250,-0.262892,-1.206207,-1.217535,-1.189249,-1.116621,0,1,0,0,0,0.547723,1.422647
76,0.158353,0.157907,0.0,0.069250,-0.262892,-1.206207,-1.217535,-1.189249,-1.116621,0,0,1,0,0,0.547723,1.422647


In [5]:
### **Generate the targeting data (Y)**
Y = debugging_dataset[['REFUSETONSCOLLECTED','PAPERTONSCOLLECTED','MGPTONSCOLLECTED','RESORGANICSTONS','SCHOOLORGANICTONS','LEAVESORGANICTONS',	'XMASTREETONS']] #Amount of Tonnage (per type of trash)
#standard scale
warnings.filterwarnings("ignore")
Y[['REFUSETONSCOLLECTED','PAPERTONSCOLLECTED','MGPTONSCOLLECTED','RESORGANICSTONS','SCHOOLORGANICTONS','LEAVESORGANICTONS',	'XMASTREETONS']] = scaler.fit_transform(Y[['REFUSETONSCOLLECTED','PAPERTONSCOLLECTED','MGPTONSCOLLECTED','RESORGANICSTONS','SCHOOLORGANICTONS','LEAVESORGANICTONS',	'XMASTREETONS']])

print('Target shape:',Y.shape)
print('-----------------------------------------------------------------------------------')
Y

Target shape: (78, 7)
-----------------------------------------------------------------------------------


Unnamed: 0,REFUSETONSCOLLECTED,PAPERTONSCOLLECTED,MGPTONSCOLLECTED,RESORGANICSTONS,SCHOOLORGANICTONS,LEAVESORGANICTONS,XMASTREETONS
0,-1.482389,0.0,0.0,0.0,0.0,0.0,0.0
1,-1.483088,0.0,0.0,0.0,0.0,0.0,0.0
2,-1.481747,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.980419,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.817699,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...
73,1.278672,0.0,0.0,0.0,0.0,0.0,0.0
74,0.291568,0.0,0.0,0.0,0.0,0.0,0.0
75,1.785131,0.0,0.0,0.0,0.0,0.0,0.0
76,0.527468,0.0,0.0,0.0,0.0,0.0,0.0


## **Baseline linear regression model**

In [6]:
import torch.nn as nn
import torch
class RegressionModel(nn.Module):
    def __init__(self, input_size=len(X.columns), hidden_sizes=[256,128,128,64,64,32], output_size=7, SEED=0): #design multiple hidden layers in the model. A fixed seed value can provide consistent results while training
        super(RegressionModel, self).__init__()
        if SEED is not None:
          torch.manual_seed(SEED)
        layers = []
        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(input_size, hidden_size)) #simple linear layers
            layers.append(nn.ReLU()) #apply ReLU as an acitivation layer for each hidden layers
            input_size = hidden_size #ensure the dimension between layers

        layers.append(nn.Linear(hidden_sizes[-1], output_size))

        self.model = nn.Sequential(*layers)

    def forward(self, x):
        out = self.model(x)
        return out

model = RegressionModel(output_size=7)  # asign the dimension of output layer = 7
Loss = nn.MSELoss() #choose the loss function (Mean Squared Error)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01) #apply Stochastic Gradient Descent and a constant learning rate = 0.01 for training

print(model)

RegressionModel(
  (model): Sequential(
    (0): Linear(in_features=16, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=128, bias=True)
    (3): ReLU()
    (4): Linear(in_features=128, out_features=128, bias=True)
    (5): ReLU()
    (6): Linear(in_features=128, out_features=64, bias=True)
    (7): ReLU()
    (8): Linear(in_features=64, out_features=64, bias=True)
    (9): ReLU()
    (10): Linear(in_features=64, out_features=32, bias=True)
    (11): ReLU()
    (12): Linear(in_features=32, out_features=7, bias=True)
  )
)



# **Training on the debugging dataset**

In [7]:
### Split the dataset into training/testing set by a ratio 4:1
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=42)
X_tensor = torch.tensor(X_train.values, dtype=torch.float32) #convert the dataset into tensor type for feeding the Pytorch model
Y_tensor = torch.tensor(Y_train.values, dtype=torch.float32) #convert the dataset into tensor type for feeding the Pytorch model
MAX_iter = 10000

for itr in range(MAX_iter):

    optimizer.zero_grad() #start with zero gradient
    outputs = model(X_tensor) #input the tensors into model to get the outputs
    target = Y_tensor #assign the comparable target data (ground truth)
    lossvalue = Loss(outputs, target)  #compute the loss
    lossvalue.backward()

    optimizer.step()
    if itr%500==0:
      print("iteration {}: loss={:.5f}".format(itr, lossvalue.item()))

iteration 0: loss=0.15529
iteration 500: loss=0.14412
iteration 1000: loss=0.14336
iteration 1500: loss=0.14272
iteration 2000: loss=0.14160
iteration 2500: loss=0.13928
iteration 3000: loss=0.13340
iteration 3500: loss=0.11460
iteration 4000: loss=0.06277
iteration 4500: loss=0.02590
iteration 5000: loss=0.01654
iteration 5500: loss=0.01338
iteration 6000: loss=0.01147
iteration 6500: loss=0.00857
iteration 7000: loss=0.00698
iteration 7500: loss=0.00577
iteration 8000: loss=0.00481
iteration 8500: loss=0.00395
iteration 9000: loss=0.00309
iteration 9500: loss=0.00252


## **Testing on the debugging dataset**

In [8]:
X_tensor = torch.tensor(X_test.values, dtype=torch.float32) #convert the dataset into tensor type for feeding the Pytorch model
Y_tensor = torch.tensor(Y_test.values, dtype=torch.float32) #convert the dataset into tensor type for feeding the Pytorch model

with torch.no_grad():
    outputs = model(X_tensor) #input the tensors into model to get the outputs
    test_loss = Loss(outputs, Y_tensor) #compute the loss with ground truth(Y)
print(f"Test Loss: {test_loss.item()}")
print('-----------------------------------------------------------------------------------')

ground_truth_Y = pandas.DataFrame(Y_tensor.numpy(), columns=['REFUSETONSCOLLECTED','PAPERTONSCOLLECTED','MGPTONSCOLLECTED','RESORGANICSTONS','SCHOOLORGANICTONS','LEAVESORGANICTONS',	'XMASTREETONS'])
print('[Ground_truth]')
print(ground_truth_Y.head())
print('-----------------------------------------------------------------------------------')
predicted_Y = pandas.DataFrame(outputs.numpy(), columns=['REFUSETONSCOLLECTED','PAPERTONSCOLLECTED','MGPTONSCOLLECTED','RESORGANICSTONS','SCHOOLORGANICTONS','LEAVESORGANICTONS',	'XMASTREETONS'])
print('[Predicted results]')
print(predicted_Y.head())


Test Loss: 0.02568584494292736
-----------------------------------------------------------------------------------
[Ground_truth]
   REFUSETONSCOLLECTED  PAPERTONSCOLLECTED  MGPTONSCOLLECTED  RESORGANICSTONS  \
0            -0.657460                 0.0               0.0              0.0   
1            -1.482389                 0.0               0.0              0.0   
2            -0.492216                 0.0               0.0              0.0   
3             0.403858                 0.0               0.0              0.0   
4             0.065744                 0.0               0.0              0.0   

   SCHOOLORGANICTONS  LEAVESORGANICTONS  XMASTREETONS  
0                0.0                0.0           0.0  
1                0.0                0.0           0.0  
2                0.0                0.0           0.0  
3                0.0                0.0           0.0  
4                0.0                0.0           0.0  
-----------------------------------------------