# Practice Notebook: New Models: Random Forest

## Question

As a data professional working for a pharmaceutical company, you need to develop a model that predicts whether a patient will be diagnosed with diabetes. The model needs to have an accuracy score greater than 0.85.
Train a random forest model. The test set accuracy should be at least 0.85.

Try n_estimators values from 1 to 10. Pick the option with the best quality for the validation set.



In [1]:
#import library
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression



In [5]:
#Data Exploration
diabetes_df.sample()
diabetes_df.head()
diabetes_df.tail()
diabetes_df.shape

(768, 9)

In [9]:
#Data Cleaning/Preparation--check for duplicates, missing values, datatypes,standardise the columns name
from pandas._libs.hashtable import duplicated
diabetes_df.duplicated().any() 
diabetes_df.isnull().sum() 
diabetes_df.dtypes
diabetes_df.columns = diabetes_df.columns.str.lower().str.strip()
diabetes_df.head()


Unnamed: 0,pregnancies,glucose,bloodpressure,skinthickness,insulin,bmi,diabetespedigreefunction,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [10]:
#data modeling
train_df, valid_df = train_test_split(diabetes_df, test_size=1, random_state=1234)
print(train_df.shape)
print(valid_df.shape)

(767, 9)
(1, 9)


In [11]:
#Construct features and target for both train and test
features_train = train_df.drop(columns=['outcome'])
target_train = train_df['outcome']
features_valid = valid_df.drop(columns=['outcome'])
target_valid = valid_df['outcome']

#Construct a model for Decision Trees, Random Forest and Logistic Regression
#a) Decision Trees, 
for d in range(1, 11, 1):
  tree_model = DecisionTreeClassifier(random_state=1234, max_depth=d)
  tree_model.fit(features_train, target_train)  #train the model
  #check for accuracy
  print(f'Decision tree has accuracy of: {tree_model.score(features_train, target_train)} for depth of: {d}')

#b) random forest 
for n in range(1,20,1):
  forest_model = RandomForestClassifier(random_state=1234, n_estimators=n)
  forest_model.fit(features_train, target_train)
  print(f'Random forest has accuracy of: {forest_model.score(features_train, target_train)} for n={n}')

#c) logistic regression
log_model = LogisticRegression(random_state=1234, solver='liblinear')
log_model.fit(features_train, target_train)
print(f'logistic regression has accuracy of: {log_model.score(features_train, target_train)}')




Decision tree has accuracy of: 0.7353324641460235 for depth of: 1
Decision tree has accuracy of: 0.771838331160365 for depth of: 2
Decision tree has accuracy of: 0.7757496740547588 for depth of: 3
Decision tree has accuracy of: 0.7913950456323338 for depth of: 4
Decision tree has accuracy of: 0.8370273794002607 for depth of: 5
Decision tree has accuracy of: 0.8513689700130378 for depth of: 6
Decision tree has accuracy of: 0.8917861799217731 for depth of: 7
Decision tree has accuracy of: 0.9282920469361148 for depth of: 8
Decision tree has accuracy of: 0.9569752281616688 for depth of: 9
Decision tree has accuracy of: 0.970013037809648 for depth of: 10
Random forest has accuracy of: 0.8917861799217731 for n=1
Random forest has accuracy of: 0.9074315514993481 for n=2
Random forest has accuracy of: 0.9595827900912647 for n=3
Random forest has accuracy of: 0.9556714471968709 for n=4
Random forest has accuracy of: 0.9739243807040417 for n=5
Random forest has accuracy of: 0.9687092568448501 f

###Conclusion
*   From above, logistic regression does not meet the criteria of >=0.85 prediction accuracy. Random forest portrays a prediction accuracy >0.85 for n_estimator >=1 whereas Decision tree depicts prediction accuracy >=0.85 where tree depth is >=6.
Thus, recommend decision tree as it meets the criteria and is fast in comparison to Random forest which also meets the criteria but is slow. 





