## Random Forest: 
Wisdom of the crowd, uses Ensemble Learning: Aggregating the outputs of the crowd and making prediction
however, one more practice. For Best Results: Use different models(e.g. Logistic Regression, Naive Bayes and Decision Tree Classifier etc..).

**However, if you dont use differnt models, at least use different data sets for training the same Base Learner - 
this is called Bagging : Bootstrapping(using data with replacement) and Aggregating(aggregate the result of the base learner)**

Why to use it?

**1. Reduces variance**: Standalone models can result in high variance. Aggregating base models’ predictions in an ensemble help reduce it.

**2. Fast**: Training can happen in parallel across CPU cores and even across different servers.

**3. Good for big data**: Bagging doesn’t require an entire training dataset to be stored in memory during model training. You can set the sample size for each bootstrap to a fraction of the overall data, train a base learner, and string these base learners together without ever reading in the entire dataset all at once. 

In [4]:
def EDA_RandomForest(rf_table_1) :
    ##DropColumns if necessary
    rf_table=rf_table_1.drop(['Cabin'],axis=1)
    rf_table= rf_table.dropna()
    print("Random Forest EDA output shape = \n",rf_table.isna().sum())
    print("Random Forest : EDA - Complete \n")
    return(rf_table)

In [5]:
def FeatureEngineering_RandomForest(df_rf_fe) :
    ## Feature Selection ("Drop/Multipy/Create new features/predictors if needed")
    print("Random Forest : Feature Engineering \n")

    print("Selected predictors for transformation\n")
    #Feature transformation 
    #Convert to string (change datatype)
    df_rf_fe['Pclass']= df_rf_fe['Pclass'].astype('str')  
    #Dummy encode the data
    df_rf_fe = pd.get_dummies(df_rf_fe,drop_first=True)
    print("Check if categorical variables are in string or dummy encoded \n",df_rf_fe.columns,"\n")

    #Feature Selection
    #Based on the correlation Matrix
    # Fare has significan positive correlation on Survival
    # Sex_male and Pclass has significant negative correlation on survival
    # So lets select Fare and Sex_Male for our modelling
   
    #df_rf_fe_selected = df_rf_fe.drop(['Embarked_S'],axis=1)
    df_rf_fe_selected = df_rf_fe
    print("Selected predictors for modelling \n")
    print(df_rf_fe_selected.info(),"\n")
    return(df_rf_fe_selected)


**Most Important Hyperparameters for Random Forest**

1. max_depth       : Speciefies how many levels your tree can have, and ultimately determines how many splits it can make.
2. min_samples_leaf: Defines the minimum number of samples for a leaf node. i.e. A split can only occur if it guarantees a minimun number of observations in the resulting nodes.
3. max_features    : Controls the randomness, It specifies the number of features that each tree randomly selects during training
4. n_estimatiors   :  Specifies the number of trees your model will build in its ensemble.

In [3]:
def HyperParameterTuning_RandomForest() :
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import GridSearchCV

    rf_cv = RandomForestClassifier(random_state=0)
    print('Random Forest : Hyperparameter Tuning ')
    param_cv_rf = {}
    
    param_cv_rf = { 'max_depth'         :[2],
                    'min_samples_leaf'  :[5],
                    'min_samples_split' :[0.3],
                    'max_features'      :[2,3],
                    'n_estimators'      :[10]
                  }
    
     
  #  param_cv_rf = {}#'max_depth'     :[None],
                    #'min_samples_leaf'  :[],
                    #'min_samples_split' :[],
                    #'max_features'      :[],
                    #'n_estimators'      :[]
    #              }

    scoring  = ( 'accuracy' , 'precision' , 'recall' , 'f1')

    rf_result = GridSearchCV(rf_cv,param_cv_rf, scoring=scoring, cv=5,refit ='f1')
    return(rf_result)

In [10]:
def Split_data(df_split_1):
    from sklearn.model_selection import train_test_split
    X= df_split_1.drop(['Survived'],axis=1)
    y=df_split_1['Survived']
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,stratify=y,random_state=42)
    
    return(X_train,X_test,y_train,y_test)
