# Description

This task is to do an exploratory data analysis on the balance-scale dataset


## Data Set Information

This data set was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The correct way to find the class is the greater of (left-distance left-weight) and (right-distance right-weight). If they are equal, it is balanced.

### Attribute Information:-

1. Class Name: 3 (L, B, R)
2. Left-Weight: 5 (1, 2, 3, 4, 5)
3. Left-Distance: 5 (1, 2, 3, 4, 5)
4. Right-Weight: 5 (1, 2, 3, 4, 5)
5. Right-Distance: 5 (1, 2, 3, 4, 5)

In [1]:
#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


In [10]:
#reading the data
data=pd.read_csv('balance-scale.data')

In [11]:
#shape of the data
data.shape

(625, 5)

In [12]:
#first five rows of the data
data.head()

Unnamed: 0,Class,LW,LD,RW,RD
0,B,1,1,1,1
1,R,1,1,1,2
2,R,1,1,1,3
3,R,1,1,1,4
4,R,1,1,1,5


In [13]:
#Generating the x values
x=data.drop(['Class'],axis=1)

In [14]:
x.head()

Unnamed: 0,LW,LD,RW,RD
0,1,1,1,1
1,1,1,1,2
2,1,1,1,3
3,1,1,1,4
4,1,1,1,5


In [15]:
#Generating the y values
y=data['Class']
y.head()

0    B
1    R
2    R
3    R
4    R
Name: Class, dtype: object

In [16]:
#Checking for any null data in x
x.isnull().any()

LW    False
LD    False
RW    False
RD    False
dtype: bool

In [17]:
#Checking for any null data in y
y.isnull().any()

False

In [18]:
#Adding left and right torque as a new data frame
x1=pd.DataFrame()
x1['LT']=x['LW']*x['LD']
x1['RT']=x['RW']*x['RD']
x1.head()

Unnamed: 0,LT,RT
0,1,1
1,1,2
2,1,3
3,1,4
4,1,5


In [19]:
#Converting the results of "Classs" attribute ,i.e., Balanced(B), Left(L) and Right(R) to numerical values for computation in sklearn
y=y.map(dict(B=0,L=1,R=2))
y.head()

0    0
1    2
2    2
3    2
4    2
Name: Class, dtype: int64

### Using the Weight and Distance parameters

Splitting the data set into a ratio of 70:30 by the built in 'train_test_split' function in sklearn to get a better idea of accuracy of the model

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x,y,stratify=y, test_size=0.3, random_state=2)

In [21]:
X_train.describe()

Unnamed: 0,LW,LD,RW,RD
count,437.0,437.0,437.0,437.0
mean,2.95881,3.059497,3.016018,3.006865
std,1.431348,1.437101,1.432653,1.400344
min,1.0,1.0,1.0,1.0
25%,2.0,2.0,2.0,2.0
50%,3.0,3.0,3.0,3.0
75%,4.0,4.0,4.0,4.0
max,5.0,5.0,5.0,5.0


In [35]:
#Importing decision tree classifier and creating it's object
from sklearn.tree import DecisionTreeClassifier
clf= DecisionTreeClassifier()

In [37]:
clf.fit(X_train,y_train)

DecisionTreeClassifier()

In [38]:
y_pred=clf.predict(X_test)

In [39]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.7446808510638298

We observe that the accuracy score is pretty low. Thus, we need to find optimal parameters to get the best accuracy. We do that by using GridSearchCV

In [40]:
#Using GridSearchCV to find the maximun optimal depth
from sklearn.model_selection import GridSearchCV
tree_para={"criterion":["gini","entropy"], "max_depth":[3,4,5,6,7,8,9,10,11,12]}
dt_model_grid= GridSearchCV(DecisionTreeClassifier(random_state=3),tree_para, cv=10)

In [41]:
dt_model_grid.fit(X_train,y_train)

GridSearchCV(cv=10, estimator=DecisionTreeClassifier(random_state=3),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12]})

In [45]:
# To print the optimum parameters computed by GridSearchCV required for best accuracy score
dt_model=dt_model_grid.best_estimator_
print(dt_model)

DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=3)


In [47]:
#To find the best accuracy score for all possible combinations of parameters provided
dt_model_grid.best_score_

0.8193446088794925

In [48]:
dt_model_grid.best_params_

{'criterion': 'entropy', 'max_depth': 5}

In [59]:
#Scoring the model
from sklearn.metrics import classification_report
y_pred1=dt_model.predict(X_test)
print(classification_report(y_test,y_pred1,target_names=["Balanced","Left","Right"]))

              precision    recall  f1-score   support

    Balanced       0.09      0.07      0.08        15
        Left       0.75      0.83      0.79        87
       Right       0.81      0.77      0.79        86

    accuracy                           0.74       188
   macro avg       0.55      0.55      0.55       188
weighted avg       0.73      0.74      0.73       188



In [60]:
from sklearn import tree


In [61]:
!pip install graphviz

Collecting graphviz
  Downloading graphviz-0.17-py3-none-any.whl (18 kB)
Installing collected packages: graphviz
Successfully installed graphviz-0.17


In [None]:
#Plotting the Tree
from sklearn.tree import export_graphviz
export_graphviz(
dt_model,
out_file=("model1.dot"),
feature_names=["Left Weight","Left Distance","Right Weight","Right Distance"],
class_names=["Balanced","Left","Right"],
filled=True)

#Run this to print png
#  !dot -Tpng model1.dot -o model1.png


## Using the created Torque

In [71]:
dt_model2 = DecisionTreeClassifier(random_state=31)
X_train, X_test, y_train, y_test= train_test_split(x1,y, stratify=y, test_size=0.3, random_state=8)

In [72]:
X_train.head(
)

Unnamed: 0,LT,RT
153,4,4
258,3,8
310,9,3
515,5,4
619,25,20


In [73]:
X_train.shape

(437, 2)

In [74]:
dt_model2.fit(X_train, y_train)

DecisionTreeClassifier(random_state=31)

In [76]:
y_pred2= dt_model2.predict(X_test)
print(classification_report(y_test, y_pred2, target_names=["Balanced","Left","Right"]))

              precision    recall  f1-score   support

    Balanced       0.65      0.73      0.69        15
        Left       1.00      1.00      1.00        86
       Right       0.95      0.93      0.94        87

    accuracy                           0.95       188
   macro avg       0.87      0.89      0.88       188
weighted avg       0.95      0.95      0.95       188



In [None]:
#Plotting the Tree
from sklearn import export_graphviz
export_graphviz(
dt_model2,
out_file=("model2.dot"),
feature_names=["Left Torque", "Right Torque"],
class_names=["Balanced","Left","Right"],
filled=True)

#  run this to make png
#  dot -Tpng model2.dot -o model2.png

## Increasing the optimization

After observing the trees, we conclude that differences are not being taken into account. Hence, we add the differences attribute to try and increase the accuracy.

In [77]:
x1['Diff']= x1['LT']- x1['RT']
x1.head()

Unnamed: 0,LT,RT,Diff
0,1,1,0
1,1,2,-1
2,1,3,-2
3,1,4,-3
4,1,5,-4


In [78]:
X_train, X_test, y_train, y_test =train_test_split(x1,y, stratify=y, test_size=0.3,random_state=40)

In [79]:
dt_model3= DecisionTreeClassifier(random_state=40)
dt_model3.fit(X_train, y_train)

DecisionTreeClassifier(random_state=40)

In [80]:
#Create Classification Report
y_pred3= dt_model3.predict(X_test)
print(classification_report(y_test, y_pred3, target_names=["Balanced", "Left", "Right"]))

              precision    recall  f1-score   support

    Balanced       1.00      1.00      1.00        15
        Left       1.00      1.00      1.00        87
       Right       1.00      1.00      1.00        86

    accuracy                           1.00       188
   macro avg       1.00      1.00      1.00       188
weighted avg       1.00      1.00      1.00       188



In [None]:
#Plotting the tree
from sklearn.tree import export_graphviz
export_graphviz(
dt_model3
out_file=("model3.dot"),
feature_names=["Left Torque","Right Torque","Difference"],
class_names=["Balanced","Left","Right"]
filled=True)

#  run this to make png
#  dot -Tpng model3.dot -o model3.png


In [82]:
from sklearn.metrics import accuracy_score

accuracy_score(y_pred3,y_test)

1.0

## Final Conclusion

The model returns a perfect accuracy score as desired.

In [1]:
!pip install seaborn


Collecting seaborn
  Downloading seaborn-0.11.2-py3-none-any.whl (292 kB)
Installing collected packages: seaborn
Successfully installed seaborn-0.11.2
