# Decription
The task is to do an exploratory data analysis on the [balance-scale dataset](https://archive.ics.uci.edu/ml/datasets/balance+scale)

## About the dataset
This data set was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The correct way to find the class is the greater of (left-distance * left-weight) and (right-distance * right-weight). If they are equal, it is balanced.

### Attribute information
+ Class Name: 3 (L, B, R)
+ Left-Weight: 5 (1, 2, 3, 4, 5)
+ Left-Distance: 5 (1, 2, 3, 4, 5)
+ Right-Weight: 5 (1, 2, 3, 4, 5)
+ Right-Distance: 5 (1, 2, 3, 4, 5)

In [33]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
df = pd.read_csv('./balance-scale.data')

In [34]:
df.head()

Unnamed: 0,Res,LW,LD,RW,RD
0,B,1,1,1,1
1,R,1,1,1,2
2,R,1,1,1,3
3,R,1,1,1,4
4,R,1,1,1,5


In [35]:
#  Generate the x values
x = df.drop(['Res'], axis=1)
x.head()

Unnamed: 0,LW,LD,RW,RD
0,1,1,1,1
1,1,1,1,2
2,1,1,1,3
3,1,1,1,4
4,1,1,1,5


In [36]:
# Generate the y values
y = df['Res']
y.head()

0    B
1    R
2    R
3    R
4    R
Name: Res, dtype: object

In [37]:
# check for any null data in x
x.isnull().any()

LW    False
LD    False
RW    False
RD    False
dtype: bool

In [38]:
# check for any null data in y
y.isnull().any()

False

In [39]:
#  Add torque as a parameten in a new dataframe
x_new = pd.DataFrame()
x_new['LT'] = x['LW']*x['LD']
x_new['RT'] = x['RW']*x['RD']
x_new.head()

Unnamed: 0,LT,RT
0,1,1
1,1,2
2,1,3
3,1,4
4,1,5


In [40]:
# Convert balanced, left, right results to integral values
y = y.map(dict(B=0, L=1, R=2))
y.head()

0    0
1    2
2    2
3    2
4    2
Name: Res, dtype: int64

# Using the weight and disctance parameters

split the dataset in 70:30 ratio using sklearn's in built train_test_split funtion, to get a better idea about the accuracy of the model

In [41]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x,y,stratify=y, test_size=0.3, random_state=42)

In [42]:
X_train.describe()

Unnamed: 0,LW,LD,RW,RD
count,437.0,437.0,437.0,437.0
mean,2.983982,3.086957,3.050343,3.045767
std,1.388756,1.403384,1.427846,1.408595
min,1.0,1.0,1.0,1.0
25%,2.0,2.0,2.0,2.0
50%,3.0,3.0,3.0,3.0
75%,4.0,4.0,4.0,4.0
max,5.0,5.0,5.0,5.0


In [43]:
from sklearn.tree import DecisionTreeClassifier

In [44]:
#  Using gridsearchCV

from sklearn.model_selection import GridSearchCV
tree_para = {'criterion':['gini','entropy'], 'max_depth':[4,5,6,7,8,9,10,11]}
dt_model_grid = GridSearchCV(DecisionTreeClassifier(random_state=31), tree_para, cv=5)

In [45]:
dt_model_grid.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=31),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [4, 5, 6, 7, 8, 9, 10, 11]})

In [46]:
dt_model = dt_model_grid.best_estimator_

In [47]:
# Scoring the model
from sklearn.metrics import classification_report
y_pred1 = dt_model.predict(X_test)
print(classification_report(y_test,y_pred1, target_names=["Balanced","Left","Right"]))

              precision    recall  f1-score   support

    Balanced       0.12      0.27      0.16        15
        Left       0.89      0.83      0.86        87
       Right       0.90      0.77      0.83        86

    accuracy                           0.76       188
   macro avg       0.64      0.62      0.62       188
weighted avg       0.83      0.76      0.79       188



In [48]:
# Plot the tree

from sklearn.tree import export_graphviz
export_graphviz( 
 dt_model,
 out_file=("model1.dot"),
 feature_names=['Left Weight','Left Distance','Right Weight','Right Distance'],
 class_names=['Balanced','Left','Right'],
 filled=True,
)

#  run this to make png
#  dot -Tpng model1.dot -o model1.png

# Using the created torque

In [49]:
dt_model2 = DecisionTreeClassifier(random_state=31)
X_train, X_test, y_train, y_test = train_test_split(x_new,y,stratify=y, test_size=0.3, random_state=42)

In [50]:
X_train.head()

Unnamed: 0,LT,RT
16,1,8
387,4,9
417,8,12
206,8,4
414,8,15


In [51]:
X_train.shape

(437, 2)

In [52]:
dt_model2.fit(X_train,y_train)

DecisionTreeClassifier(random_state=31)

In [53]:
y_pred2 = dt_model2.predict(X_test)
print(classification_report(y_test,y_pred2, target_names=["Balanced","Left","Right"]))

              precision    recall  f1-score   support

    Balanced       0.88      1.00      0.94        15
        Left       1.00      1.00      1.00        87
       Right       1.00      0.98      0.99        86

    accuracy                           0.99       188
   macro avg       0.96      0.99      0.98       188
weighted avg       0.99      0.99      0.99       188



In [54]:
# Plot the tree

from sklearn.tree import export_graphviz
export_graphviz( 
 dt_model2,
 out_file=("model2.dot"),
 feature_names=['Left Torque','Right Torque'],
 class_names=['Balanced','Left','Right'],
 filled=True,
)

#  run this to make png
#  dot -Tpng model2.dot -o model2.png

# Even more optimized?
After looking at the trees i can see that the model is not taking the differences into account like i had hoped... hence resulting in the slightly less accuracy

to fix this maybe adding the difference outright might be helpful, though it could very well just be due to the differences in real life and theoretical physics formulaes

In [55]:
x_new['diff'] = x_new['LT']-x_new['RT']
x_new.head()

Unnamed: 0,LT,RT,diff
0,1,1,0
1,1,2,-1
2,1,3,-2
3,1,4,-3
4,1,5,-4


In [56]:
X_train, X_test, y_train, y_test = train_test_split(x_new,y,stratify=y, test_size=0.3, random_state=42)

In [57]:
dt_model3 = DecisionTreeClassifier(random_state=42)
dt_model3.fit(X_train,y_train)

DecisionTreeClassifier(random_state=42)

In [58]:
# Create classification report
y_pred3 = dt_model3.predict(X_test)
print(classification_report(y_test,y_pred3, target_names=["Balanced","Left","Right"]))

              precision    recall  f1-score   support

    Balanced       1.00      1.00      1.00        15
        Left       1.00      1.00      1.00        87
       Right       1.00      1.00      1.00        86

    accuracy                           1.00       188
   macro avg       1.00      1.00      1.00       188
weighted avg       1.00      1.00      1.00       188



In [59]:
# Plot the tree

from sklearn.tree import export_graphviz
export_graphviz( 
 dt_model3,
 out_file=("model3.dot"),
 feature_names=['Left Torque','Right Torque','Difference'],
 class_names=['Balanced','Left','Right'],
 filled=True,
)

#  run this to make png
#  dot -Tpng model3.dot -o model3.png

# Final Conclusion
It was able to get a 'Perfect' Score upon adding the difference feature.. as expected.

## Create a pickle file of the final models for deployement

In [60]:
import pickle

pickle.dump(dt_model3,open('model3.pkl','wb'))
pickle.dump(dt_model2,open('model2.pkl','wb'))
pickle.dump(dt_model ,open('model1.pkl','wb'))