# Introduction
In Part III, we will use machine learning techniques to predict 'Occupancy'. The process goes like this: 

![MachineLearningProcess](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/CommonAssets/MachineLearningProcess.png)

We put this section on all of the projects in UpLevel so bear with us if you've seen this before. 

Generally, the machine learning process has five parts:
1. <strong>Split your data into train and test set</strong>
2. <strong>Model creation</strong>
<br>
Import your models from sklearn and instantiate them (assign model object to a variable)
3. <strong>model fitting</strong>
<br>
Fit your training data into the model and train train train
4. <strong>model prediction</strong>
<br>
Make a set of predictions using your test data, and
5. <strong>Model assessment</strong>
<br>
Compare your predictions with ground truth in test data

Highly recommended readings:
1. [Important] https://scipy-lectures.org/packages/scikit-learn/index.html
2. https://machinelearningmastery.com/a-gentle-introduction-to-scikit-learn-a-python-machine-learning-library/
3. https://scikit-learn.org/stable/tutorial/basic/tutorial.html

### Step 1: Import your libraries
We will be using models from sklearn - a popular machine learning library. However, we won't import everything from sklearn and take just what we need. 

We'll need to import plotting libraries to plot our predictions against the ground truth (test data). 

Import the following:
- pandas as pd

In [1]:
# Step 1: Import your library
import pandas as pd

### Step 2: Read the CSV from Part II as a DataFrame
Read your CSV from the previous Part as a DataFrame. 

You should have:
- 20,560 rows
- 10 columns

In [6]:
# Step 2: Read the CSV from Part II
df = pd.read_csv('/Users/sm/Desktop/Ex2.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20560 entries, 0 to 20559
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           20560 non-null  object 
 1   Temperature    20560 non-null  float64
 2   Humidity       20560 non-null  float64
 3   Light          20560 non-null  float64
 4   CO2            20560 non-null  float64
 5   HumidityRatio  20560 non-null  float64
 6   Occupancy      20560 non-null  int64  
 7   weekday        20560 non-null  int64  
 8   hour           20560 non-null  int64  
 9   minute         20560 non-null  int64  
dtypes: float64(5), int64(4), object(1)
memory usage: 1.6+ MB


### Step 3: Prepare your independent and dependent variables
At this point, let's prepare our indepedent and dependent variables. 

1. Declare a variable, and assign your independent variables to it by dropping 'date' and 'Occupancy'
2. Declare another variable, and assign only values 'Occupancy'

In [13]:
# Step 3: Prepare your independent and dependent variables
X = df.drop(['date','Occupancy'],axis=1)
y = df['Occupancy']

### Step 4: Import machine learning libraries
Time to import other libraries.

The resources provided at the top of this notebook will be immensely useful if you're new to modelling. 

Import the following libraries and methods:
1. train_test_split - sklearn.model_selection
2. DummyClassifier - sklearn.dummy
3. LogisticRegression - sklearn.linear_model
4. DecisionTreeClassifier - sklearn.tree
5. RandomForestClassifier - sklearn.ensemble
6. f1_score - sklearn.metrics
7. confusion_matrix - sklearn.metrics

In [35]:
# Step 4: Import the machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, f1_score

### Step 5: Split your dataset into train and test
Now that you have finished importing the libraries you need, split the dataset into train and test at a 80/20 split.

Don't forget to stratify by your dependent values with the stratify parameter.

In [24]:
from collections import Counter
Counter(y)

Counter({1: 4750, 0: 15810})

In [27]:
4750/15810

0.30044275774826057

In [26]:
# Step 5: Split your dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0, stratify=y)

In [29]:
print("Train:", Counter(y_train))
print("Test:", Counter(y_test))

Train: Counter({0: 11857, 1: 3563})
Test: Counter({0: 3953, 1: 1187})


### Step 6: Train a DummyClassifier
This is what you'll need to do:
1. Start with a model
2. Declare a variable, and store your model in it (don't forget to use brackets)
3. Fit your training data into the instantiated model
4. Declare a variable that contains predictions from the model you just trained, using the train dataset (X_test)
5. Compare the prediction with the actual result (y_test) with the f1_score
6. Plot a confusion_matrix using the prediction (y-axis) vs actual y_test (x-axis) 

The recommended readings will be very helpful.

Let's start with the DummyClassifier to establish a baseline. This will be useful as we train other models.

In [39]:
# Step 6a: Declare a variable to store the model
dummy_clf = DummyClassifier()
# Step 6b: Fit your train dataset
dummy_clf.fit(X_train, y_train)
# Step 6c: Declare a variable and store your predictions that you make with your model using X test data
pred_dummy = dummy_clf.predict(X_test)
# Step 6d: Print the f1_score between the y test and dummy prediction
print("f1-score: ", f1_score(y_test, pred_dummy, average='macro'))
# Step 6e: Print a confusion_matrix between y_test and your prediction
print(confusion_matrix(y_test, pred_dummy))

f1-score:  0.4347300120972177
[[3953    0]
 [1187    0]]


### Step 7: Train a LogisticRegression
Now that we have established the baseline performance of a classifier, let's train a LogisticRegression model. 

Similar to how we did in training the DummyClassifier, train the model and then assess the model performance with the f1_score and the confusion_matrix.

In [40]:
# Step 7a: Declare a variable to store the LogisticRegression model
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
# Step 7b: Fit your train dataset
clf.fit(X_train, y_train)
# Step 7c: Declare a variable and store your predictions that you make with your model using X test data 
pred_LR = clf.predict(X_test)
# Step 7d: Print f1_score between the y test and LogisticRegression prediction
print("f1-score: ", f1_score(y_test, pred_LR, average='macro'))
# Step 7e: Print a confusion_matrix between y_test and your prediction
print(confusion_matrix(y_test, pred_LR))

f1-score:  0.9856493188062142
[[3907   46]
 [   7 1180]]


### Step 8: Train a DecisionTreeClassifier
The LogisticRegression model should perform quite impressively, based on the confusion matrix and the f1_score. 

Can we improve it further? Let's find out by training and assessing a DecisionTreeClassifier.

In [41]:
# Step 8: Train a DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
clf1 = DecisionTreeClassifier()
clf1.fit(X_train, y_train)
pred_DT = clf1.predict(X_test)
print("f1-score: ", f1_score(y_test, pred_DT, average='macro'))
print(confusion_matrix(y_test, pred_DT))

f1-score:  0.9887285277624855
[[3939   14]
 [  27 1160]]


### Step 9: Train a RandomForestClassifier
The DecisionTreeClassifier is most likely (slightly) better than the LogisticRegression results, in terms of f1 score.

Train a RandomForestClassifier and see if you can push the performance even further.

In [43]:
# Step 9: Train a RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
clf2 = RandomForestClassifier()
clf2.fit(X_train, y_train)
pred_RF = clf2.predict(X_test)
print("f1-score: ", f1_score(y_test, pred_RF, average='macro'))
print(confusion_matrix(y_test, pred_RF))

f1-score:  0.9915078953815775
[[3938   15]
 [  16 1171]]


### Optional: Train other classifiers
There are a few other classifiers that you can try, apart from the three that we used above.

It's hard to top RandomForestClassifier for this dataset, but it's still worth typing it out to get some practice in.

In [None]:
# Optional: Try other classifiers

### Step 10: Get a feature importances DataFrame
Create a DataFrame containing the feature importances of your best performing model. 

For example, this is what an example DataFrame would look like:

![RandomForestClassifierFeatureImportances](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectRoomOccupancy/RandomForestClassifierFeatureImportances.png)

What's the most important feature? 

Does it align with what you observed in Part II? 

In [51]:
# Step 10: Create a DataFrame containing feature importances
pd.DataFrame({'feature':X_train.columns, 'importance':clf2.feature_importances_})

Unnamed: 0,feature,importance
0,Temperature,0.178135
1,Humidity,0.019396
2,Light,0.533323
3,CO2,0.097639
4,HumidityRatio,0.028111
5,weekday,0.048848
6,hour,0.085892
7,minute,0.008656


## Modelling without 'Light'
Whichever model you used, it's most likely that you identified "Light" as the most important feature in the model.

This makes sense, because if there's 'Light', it's most likely that there's someone in the room. 

Here's a challenge - let's try modelling without "Light" as a feature. 

### Step 11: Repeat Step 3 and drop 'Light'
Repeat what you did in Step 3, i.e. prepare independent and dependent values.

However, this time drop 'Light' on top of 'date' and 'Occupancy' to prepare your independent values.

In [53]:
# Step 11: Prepare new independent and dependent values
X1 = df.drop(['date','Occupancy','Light'],axis=1)
y1 = df['Occupancy']

### Step 12: Repeat Steps 6-10
Now that you've removed 'Light' column, time to split your data and model again.

One thing to note - when you train a LogisticRegression you <strong>may</strong> receive a warning. Don't worry - just increase the value of max_iter.

In [54]:
# Step 12a: Split your data into train and test
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.25, random_state=0, stratify=y1)

In [57]:
# Step 12b: Train and assess a DummyClassifier
dummy_clf1 = DummyClassifier()
dummy_clf1.fit(X_train1, y_train1)
pred_dummy1 = dummy_clf1.predict(X_test1)
print("f1-score: ", f1_score(y_test1, pred_dummy1, average='macro'))
print(confusion_matrix(y_test1, pred_dummy1))

f1-score:  0.4347300120972177
[[3953    0]
 [1187    0]]


In [58]:
# Step 12b: Train and assess a LogisticRegression
from sklearn.linear_model import LogisticRegression
clf_a = LogisticRegression()
clf_a.fit(X_train1, y_train1)
pred_LR1 = clf_a.predict(X_test1)
print("f1-score: ", f1_score(y_test1, pred_LR1, average='macro'))
print(confusion_matrix(y_test1, pred_LR1))

f1-score:  0.7826083338655963
[[3701  252]
 [ 485  702]]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [59]:
# Step 12c: Train a DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
clf1_a = DecisionTreeClassifier()
clf1_a.fit(X_train1, y_train1)
pred_DT1 = clf1_a.predict(X_test1)
print("f1-score: ", f1_score(y_test1, pred_DT1, average='macro'))
print(confusion_matrix(y_test1, pred_DT1))

f1-score:  0.9832501512502687
[[3927   26]
 [  35 1152]]


In [60]:
# Step 12d: Train a RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
clf2_a = RandomForestClassifier()
clf2_a.fit(X_train1, y_train1)
pred_RF1 = clf2_a.predict(X_test1)
print("f1-score: ", f1_score(y_test1, pred_RF1, average='macro'))
print(confusion_matrix(y_test1, pred_RF1))

f1-score:  0.9882136741111288
[[3933   20]
 [  23 1164]]


In [62]:
# Step 12e: Create a DataFrame containing feature importances
pd.DataFrame({'feature':X_train1.columns, 'importance':clf2_a.feature_importances_})

Unnamed: 0,feature,importance
0,Temperature,0.279353
1,Humidity,0.063016
2,CO2,0.211105
3,HumidityRatio,0.074404
4,weekday,0.105689
5,hour,0.241689
6,minute,0.024744


<details>
    <summary><strong>Did removing 'Light' affect model performance adversely?</strong></summary>
    <div>No, not really. The f1 score and confusion matrix look great</div>
</details>

<details>
    <summary><strong>What were the features that were important?</strong></summary>
    <div>In the new DataFrame, the features were 'Temperature', 'CO2', and 'hour'. It seems that the model considered these three features in the absence of light.</div>
</details>

# The end
You did it! You've arrived at the end. Congratulations and well done on completing this project series! 

Let's review.
1. In Part I, you collected the datasets and combined them to form a single DataFrame. You also investigated the data briefly to see if there was anything remarkable about it
2. In Part II, you performed exploratory data analysis on the dataset, investigating distributions and relationships found between features. You also engineered additional features from the dataset for model building
5. In Part III, you trained a machine learning model that can predict room occupancy based on sensor data. In addition, you modelled the problem without a major feature to see if the model performed equally well

Go on, give yourself a pat on the back. We hope this project series has give you more confidence in coding and machine learning. 

Whatever you learn here is but a tip of the iceberg, and launchpad for bigger and better things to come. Come join us in our Telegram community over at https://bit.ly/UpLevelSG and our Facebook page at https://fb.com/UpLevelSG