# Dataset | Problem

The 2012 US Army Anthropometric Survey (ANSUR II) was executed by the Natick Soldier Research, Development and Engineering Center (NSRDEC) from October 2010 to April 2012 and is comprised of personnel representing the total US Army force to include the US Army Active Duty, Reserves, and National Guard. In addition to the anthropometric and demographic data described below, the ANSUR II database also consists of 3D whole body, foot, and head scans of Soldier participants. These 3D data are not publicly available out of respect for the privacy of ANSUR II participants. The data from this survey are used for a wide range of equipment design, sizing, and tariffing applications within the military and has many potential commercial, industrial, and academic applications.

The ANSUR II working databases contain 93 anthropometric measurements which were directly measured, and 15 demographic/administrative variables explained below. The ANSUR II Male working database contains a total sample of 4,082 subjects. The ANSUR II Female working database contains a total sample of 1,986 subjects.


data dict:
https://data.world/datamil/ansur-ii-data-dictionary/workspace/file?filename=ANSUR+II+Databases+Overview.pdf

Hİnt for metric : Our mission to classify soldiers races via their body sclales. We want a balanced score for our predictions.

# Ingest the data from links below and make a dataframe
- Soldiers Male : https://query.data.world/s/h3pbhckz5ck4rc7qmt2wlknlnn7esr
- Soldiers Female : https://query.data.world/s/sq27zz4hawg32yfxksqwijxmpwmynq

# EDA
Tips :
- Drop unnecessary colums
- Drop DODRace class if value count below 500 (we assume that our data model can't learn if it is below 500)
- Find unusual value in Weightlbs

# Context:

### SUBJECT: 2012 US Army Anthropometric Working Databases

#### Background
    1. This memorandum outlines the contents of the ANSUR II Working Databases and provides a
    brief explanation of each variable contained in the databases. These databases and this
    memorandum have been reviewed and cleared for UNLIMITED PUBLIC RELEASE.
    2. The 2012 US Army Anthropometric Survey (ANSUR II) was executed by the Natick Soldier
    Research, Development and Engineering Center (NSRDEC) from October 2010 to April 2012
    and is comprised of personnel representing the total US Army force to include the US Army
    Active Duty, Reserves, and National Guard. In addition to the anthropometric and demographic
    data described below, the ANSUR II database also consists of 3D whole body, foot, and head
    scans of Soldier participants. These 3D data are not publicly available out of respect for the
    privacy of ANSUR II participants. The data from this survey are used for a wide range of
    equipment design, sizing, and tariffing applications within the military and has many potential
    commercial, industrial, and academic applications.
    3. The ANSUR II working databases contain 93 anthropometric measurements which were
    directly measured, and 15 demographic/administrative variables explained below. The
    ANSUR II Male working database contains a total sample of 4,082 subjects. The ANSUR II
    Female working database contains a total sample of 1,986 subjects. The databases are reported in
    the associated spreadsheet files:
        a. “ANSUR II MALE Public.csv”
        b. “ANSUR II FEMALE Public.csv”. 

#### Data Content
    4. Demographic/Administrative Data: The following variables are included in the ANSUR II
    working databases for each subject and were assigned to or collected from subjects at the time of
    their participation.
         subjectid – A unique number for each participant measured in the anthropometric survey,
        ranging from 10027 to 920103, not inclusive
         SubjectBirthLocation – Subject Birth Location; a U.S. state or foreign country
         SubjectNumericRace – Subject Numeric Race; a single or multi-digit code
        indicating a subject’s self-reported race or races (verified through interview).
        Where 
                1 = White, 
                2 = Black, 
                3 = Hispanic, 
                4 = Asian, 
                5 = Native American,
                6 = Pacific Islander, 
                8 = Other
         Ethnicity – self-reported ethnicity (verified through interview); e.g. “Mexican”,
        “Vietnamese”
         DODRace – Department of Defense Race; a single digit indicating a subject’s
        self-reported preferred single race where selecting multiple races is not an option.
        This variable is intended to be comparable to the Defense Manpower Data Center
        demographic data. 
        Where 
                1 = White, 
                2 = Black, 
                3 = Hispanic, 
                4 = Asian,
                5 = Native American, 
                6 = Pacific Islander, 
                8 = Other
         Gender – “Male” or “Female”
         Age – Participant’s age in years
         Heightin – Height in Inches; self-reported, comparable to measured “stature”
         Weightlbs – Weight in Pounds; self-reported, comparable to measured “weightkg”
         WritingPreference – Writing Preference; “Right hand”, “Left hand”, or
        “Either hand (No preference)”
         Date – Date the participant was measured, ranging from “04-Oct-10” to “05-Apr-12”
         Installation – U.S. Army installation where the measurement occurred;
        e.g. “Fort Hood”, “Camp Shelby”
         Component – “Army National Guard”, “Army Reserve”, or “Regular Army”
         Branch – “Combat Arms”, “Combat Support”, or “Combat Service Support”
         PrimaryMOS – Primary Military Occupational Specialty
    5. Anthropometric Data: the following variables are included in the ANSUR II working
    databases for each subject and were directly-measured dimensions of the participant’s body. All
    measurements are recorded in millimeters with the exception of the variable “weightkg”.
         abdominalextensiondepthsitting – Abdominal Extension Depth, Sitting
         acromialheight – Acromial Height
         acromionradialelength – Acromion-Radiale Length
         anklecircumference – Ankle Circumference
         axillaheight – Axilla Height
         balloffootcircumference – Ball of Foot Circumference
         balloffootlength – Ball of Foot Length
         biacromialbreadth – Biacromial Breadth
         bicepscircumferenceflexed – Biceps Circumference, Flexed
         bicristalbreadth – Bicristal Breadth
         bideltoidbreadth – Bideltoid Breadth
         bimalleolarbreadth – Bimalleolar Breadth
         bitragionchinarc – Bitragion Chin Arc
         bitragionsubmandibulararc – Bitragion Submandibular Arc
         bizygomaticbreadth – Bizygomatic Breadth
         buttockcircumference – Buttock Circumference
         buttockdepth – Buttock Depth
         buttockheight – Buttock Height
         buttockkneelength – Buttock-Knee Length
         buttockpopliteallength – Buttock-Popliteal Length
         calfcircumference – Calf Circumference
         cervicaleheight – Cervical Height
         chestbreadth – Chest Breadth
         chestcircumference – Chest Circumference
         chestdepth – Chest Depth
         chestheight – Chest Height
         crotchheight – Crotch Height
         crotchlengthomphalion – Crotch Length (Omphalion)
         crotchlengthposterioromphalion – Crotch Length, Posterior (Omphalion)
         earbreadth – Ear Breadth
         earlength – Ear Length
         earprotrusion – Ear Protrusion
         elbowrestheight – Elbow Rest Height
         eyeheightsitting – Eye Height, Sitting 
         footbreadthhorizontal – Foot Breadth, Horizontal
         footlength – Foot Length
         forearmcenterofgriplength – Forearm-Center of Grip Length
         forearmcircumferenceflexed – Forearm Circumference, Flexed
         forearmforearmbreadth – Forearm-Forearm Breadth
         forearmhandlength – Forearm -Hand Length
         functionalleglength – Functional Leg Length
         handbreadth – Hand Breadth
         handcircumference – Hand Circumference
         handlength – Hand Length
         headbreadth – Head Breadth
         headcircumference – Head Circumference
         headlength – Head Length
         heelanklecircumference – Heel-Ankle Circumference
         heelbreadth – Heel Breadth
         hipbreadth – Hip Breadth
         hipbreadthsitting – Hip Breadth, Sitting
         iliocristaleheight – Iliocristale Height
         interpupillarybreadth – Interpupillary Breadth
         interscyei – Interscye I
         interscyeii – Interscye II
         kneeheightmidpatella – Knee Height, Midpatella
         kneeheightsitting – Knee Height, Sitting
         lateralfemoralepicondyleheight – Lateral Femoral Epicondyle Height
         lateralmalleolusheight – Lateral Malleolus Height
         lowerthighcircumference – Lower Thigh Circumference
         mentonsellionlength – Menton-Sellion Length
         neckcircumference – Neck Circumference
         neckcircumferencebase – Neck Circumference, Base
         overheadfingertipreachsitting – Overhead Fingertip Reach, Sitting
         palmlength – Palm Length
         poplitealheight – Popliteal Height
         radialestylionlength – Radiale-Stylion Length
         shouldercircumference – Shoulder Circumference
         shoulderelbowlength – Shoulder-Elbow Length
         shoulderlength – Shoulder Length
         sittingheight – Sitting Height
         sleevelengthspinewrist – Sleeve Length: Spine-Wrist
         sleeveoutseam – Sleeve Outseam
         span - Span
         stature - Stature
         suprasternaleheight – Suprasternale Height
         tenthribheight – Tenth Rib Height
         thighcircumference – Thigh Circumference
         thighclearance – Thigh Clearance
         thumbtipreach – Thumbtip Reach
         tibialheight – Tibiale Height
         tragiontopofhead – Tragion-Top of Head
         trochanterionheight – Trochanterion Height
         verticaltrunkcircumferenceusa – Vertical Trunk Circumference (USA)
         waistbacklength – Waist Back Length (Omphalion)
         waistbreadth – Waist Breadth
         waistcircumference – Waist Circumference (Omphalion)
         waistdepth – Waist Depth
         waistfrontlengthsitting – Waist Front Length, Sitting
         waistheightomphalion – Waist Height (Omphalion)
         weightkg – Weight (in kg*10)
         wristcircumference – Wrist Circumference
         wristheight – Wrist Height
        
#### Recommendations:
    6. The ANSUR II working databases are a representative sample of the US Army at the time of
    data collection and may or may not be representative of other populations of interest, to include
    later instances of the US Army. Other US military services maintain anthropometric databases of
    their service members which are distinct from the US Army’s anthropometric databases
    (ANSUR II). The US Army also maintains separate anthropometric databases representing Male
    and Female US Army pilots which are distinct from ANSUR II.
    7. The ANSUR II working databases are presented as two separate databases – one Female, one
    Male. In almost all cases, these databases should be treated and analyzed separately.
    Combination of the databases will result in a sample that is not representative of any real
    population and could easily lead to erroneous conclusions.
    8. Much more information about the data collection methodology and content of the ANSUR II
    Working Databases may be found in the following Technical Reports, available from the
    Defense Technical Information Center (www.dtic.mil) through the hyperlinks provided:
        a. 2010-2012 Anthropometric Survey of U.S. Army Personnel: Methods and Summary
        Statistics. (NATICK/TR-15/007)
        b. Measurer’s Handbook: US Army and Marine Corps Anthropometric Surveys,
    2010-2011 (NATICK/TR-11/017)
    9. The primary POC for the ANSUR II working databases is Joseph L Parham, Research
    Anthropologist, Email: joseph.l.parham2.civ@mail.mil.
    Steven P Paquette Joseph L Parham
    Anthropometry Team Leader Research Anthropologist
    Natick RD&E Center, Natick, MA Natick RD&E Center, Natick, MA

# Import Libraries

In [1]:
#pip install pyforest

In [2]:
# 1-Import Libraies

#!pip install lightgbm
#!pip install catboost

import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
%matplotlib inline
%matplotlib notebook
import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

#Model Selection
from sklearn import model_selection
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.model_selection import KFold, cross_val_predict

#Feature Selection
from sklearn.feature_selection import SelectKBest, SelectPercentile, f_classif, f_regression, mutual_info_regression

#Models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.linear_model import LogisticRegression

from sklearn import neighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB

from sklearn.svm import SVC
from sklearn.svm import SVR

from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import ExtraTreesRegressor

from xgboost import XGBClassifier
from xgboost import plot_importance

from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

from sklearn.neural_network import MLPRegressor

#Scaling
from sklearn.preprocessing import scale 
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures 
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PowerTransformer 
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler


#Metrics
from sklearn import metrics
from sklearn.metrics import roc_auc_score, auc, roc_curve, precision_recall_curve
from sklearn.metrics import accuracy_score, recall_score, average_precision_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 


#Importing plotly and cufflinks in offline mode
import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

#Ignore Warnings
import warnings
warnings.filterwarnings("ignore")
warnings.warn("this will not show")

#Figure&Display options
plt.rcParams["figure.figsize"] = (10,6)
pd.set_option('max_colwidth',200)
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 200)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

## Useful Functions

In [3]:
## Useful Functions

###############################################################################

def first_looking(column):
    print("column name    : ", column) 
    print("--------------------------------")
    print("per_of_nulls   : ", "%", round(df[column].isnull().sum()/df.shape[0]*100, 2))
    print("num_of_nulls   : ", df[column].isnull().sum())
    print("num_of_uniques : ", df[column].nunique())
    print("value_counts : ", df[column].value_counts(dropna = False).head())
    
# for i in df.columns:
#     first_looking(i)

###############################################################################

def missing (df):
    missing_number = df.isnull().sum().sort_values(ascending=False)
    missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
    missing_values = pd.concat([missing_number, missing_percent], axis=1, keys=['Missing_Number', 'Missing_Percent'])
    return missing_values

###############################################################################

def perc_nans(serial):  # Ex:perc_nans(df['kW'])
    # display percentage of nans in a Series
    return serial.isnull().sum()/serial.shape[0]*100

def perc_nans_byLimitless(df):
    return df.isnull().sum()/df.shape[0]*100

def perc_nans_byLimit(df, limit):
    missing = df.isnull().sum()*100/df.shape[0]
    return missing.loc[lambda x : x >= limit]

# perc_nans_byLimit(df, 90)

###############################################################################

def fill_median(df, group_col, col_name):
    '''Fills the missing values with the most existing value (median) in the relevant column according to single-stage grouping'''
    for group in list(df[group_col].unique()):
        cond = df[group_col]==group
        median = list(df[cond][col_name].median())
        if median != []:
            df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[cond][col_name].median()[0])
        else:
            df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[col_name].median()[0])
    print("Number of NaN : ",df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))
    
###############################################################################

def fill_most(df, group_col, col_name):
    '''Fills the missing values with the most existing value (mode) in the relevant column according to single-stage grouping'''
    for group in list(df[group_col].unique()):
        cond = df[group_col]==group
        mode = list(df[cond][col_name].mode())
        if mode != []:
            df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[cond][col_name].mode()[0])
        else:
            df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[col_name].mode()[0])
    print("Number of NaN : ",df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))
    
###############################################################################

def fill_prop(df, group_col, col_name):
    for group in list(df[group_col].unique()):
        cond = df[group_col]==group
        df.loc[cond, col_name] = df.loc[cond, col_name].fillna(method="ffill").fillna(method="bfill")
    df[col_name] = df[col_name].fillna(method="ffill").fillna(method="bfill")
    print("Number of NaN : ",df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))
    
###############################################################################

def fill(df, group_col1, group_col2, col_name, method): # method can be "mode" or "median" or "ffill"
    if method == "mode":
        for group1 in list(df[group_col1].unique()):
            for group2 in list(df[group_col2].unique()):
                cond1 = df[group_col1]==group1
                cond2 = (df[group_col1]==group1) & (df[group_col2]==group2)
                mode1 = list(df[cond1][col_name].mode())
                mode2 = list(df[cond2][col_name].mode())
                if mode2 != []:
                    df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[cond2][col_name].mode()[0])
                elif mode1 != []:
                    df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[cond1][col_name].mode()[0])
                else:
                    df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[col_name].mode()[0])
                
    elif method == "median":
        for group1 in list(df[group_col1].unique()):
            for group2 in list(df[group_col2].unique()):
                cond1 = df[group_col1]==group1
                cond2 = (df[group_col1]==group1) & (df[group_col2]==group2)
                df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[cond2][col_name].median()).fillna(df[cond1][col_name].median()).fillna(df[col_name].median())
                
    elif method == "ffill":           
        for group1 in list(df[group_col1].unique()):
            for group2 in list(df[group_col2].unique()):
                cond2 = (df[group_col1]==group1) & (df[group_col2]==group2)
                df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(method="ffill").fillna(method="bfill")
                
        for group1 in list(df[group_col1].unique()):
            cond1 = df[group_col1]==group1
            df.loc[cond1, col_name] = df.loc[cond1, col_name].fillna(method="ffill").fillna(method="bfill")            
           
        df[col_name] = df[col_name].fillna(method="ffill").fillna(method="bfill")
    
    print("Number of NaN : ",df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))
    
###############################################################################

def model_validation(y_train, y_train_pred, y_test, y_test_pred, model_name):
    
    scores =  {f"{model_name}_train": {"R2" : r2_score(y_train, y_train_pred),
    "rmse" : np.sqrt(mean_squared_error(y_train, y_train_pred)),
    "mse" : mean_squared_error(y_train, y_train_pred), 
    "mae" : mean_absolute_error(y_train, y_train_pred)},
    
    f"{model_name}_test": {"R2" : r2_score(y_test, y_test_pred),
    "rmse" : np.sqrt(mean_squared_error(y_test, y_test_pred)),
    "mse" : mean_squared_error(y_test, y_test_pred),
    "mae" : mean_absolute_error(y_test, y_test_pred)}}
     
    return pd.DataFrame(scores)

# lm = model_validation(y_train, y_train_pred, y_test, y_test_pred, 'lm')

# pd.concat([lm, rs, rcvs, lss, lcvs, es, ecvs], axis = 1)

###############################################################################

def get_classification_report(y_test, y_test_pred):
    from sklearn import metrics
    report = metrics.classification_report(y_test, y_test_pred, output_dict=True)
    df_classification_report = pd.DataFrame(report).transpose()
    #df_classification_report = df_classification_report.sort_values(by=['f1-score'], ascending=False)
    return df_classification_report

###############################################################################

def shape_control():
    print('df.shape:', df.shape)
    print('X.shape:', X.shape)
    print('y.shape:', y.shape)
    print('X_train.shape:', X_train.shape)
    print('y_train.shape:', y_train.shape)
    print('X_test.shape:', X_test.shape)
    print('y_test.shape:', y_test.shape)
    try:
        print('y_test_pred.shape:', y_test_pred.shape)
    except:
        print()
        
###############################################################################

def calc_predict():
    return accuracy_score(y_test, y_test_pred), recall_score(y_test, y_test_pred)
    
def get_report():
    from sklearn import metrics
    pd.set_option('display.float_format', lambda x: '%.3f' % x)
    y_train_pred = model.predict(X_train_scaled)
    try:
        y_train_pred_proba = model.predict_proba(X_train_scaled)
    except:
        print()
    try:
        precision, recall, _ = precision_recall_curve(y_train, y_train_pred_proba[:,1])
    except:
        print() 
    try:
        y_test_pred_proba = model.predict_proba(X_test_scaled)
    except:
        print()
    try:
        precision, recall, _ = precision_recall_curve(y_test, y_test_pred_proba[:,1])
    except:
        print()  
    print('Model:', model.get_params, '\n')
    try:
        print('model.best_params_:', model.best_params_, '\n')
    except:
        print()
    print("Train:")
    print('rmse:', np.sqrt(mean_squared_error(y_train, y_train_pred)))
    print('accuracy:', accuracy_score(y_train, y_train_pred))
    try:
        print('roc_auc_score:',roc_auc_score(y_train, y_train_pred_proba[:,1]))
    except:
        print()
    try:
        print('roc_auc_recall_precision_score:',auc(recall, precision),'\n')
    except:
        print()
    print('confusion_matrix:\n\n', confusion_matrix(y_train, y_train_pred), '\n')
    print('classification_report:\n\n', classification_report(y_train, y_train_pred),'\n')
    print()
    print("Test:")
    print('rmse:', np.sqrt(mean_squared_error(y_test, y_test_pred))) 
    print('accuracy:', accuracy_score(y_test, y_test_pred))
    try:
        print('roc_auc_score:',roc_auc_score(y_test, y_test_pred_proba[:,1]))
    except:
        print() 
    try:
        print('roc_auc_recall_precision_score:',auc(recall, precision),'\n')
    except:
        print() 
    print('confusion_matrix:\n\n', confusion_matrix(y_test, y_test_pred), '\n')
    print('classification_report:\n\n', classification_report(y_test, y_test_pred))

def train_control_table():
    y_train_pred = model.predict(X_train_scaled)
    y_train_pred = pd.DataFrame(y_train_pred)
    y_train_pred.rename(columns = {0: 'y_train_pred'}, inplace = True)
    return pd.concat([X_train, y_train, y_train_pred.set_index(y_train.index)], axis=1)

def test_control_table():
    y_test_pred = model.predict(X_test_scaled)
    y_test_pred = pd.DataFrame(y_test_pred)
    y_test_pred.rename(columns = {0: 'y_test_pred'}, inplace = True)
    return pd.concat([X_test, y_test, y_test_pred.set_index(y_test.index)], axis=1)

###############################################################################

def feature_importances():
    df_fi = pd.DataFrame(index=X.columns, 
                         data=model.feature_importances_, 
                         columns=["Feature Importance"]).sort_values("Feature Importance")

    return df_fi.sort_values(by="Feature Importance", ascending=False).T

def feature_importances_bar():
    df_fi = pd.DataFrame(index=X.columns, 
                         data=model.feature_importances_, 
                         columns=["Feature Importance"]).sort_values("Feature Importance")
    sns.barplot(data = df_fi, 
                x = df_fi.index, 
                y = 'Feature Importance', 
                order=df_fi.sort_values('Feature Importance', ascending=False).reset_index()['index'])
    plt.xticks(rotation = 90)
    plt.tight_layout()
    plt.show();

In [4]:
def outlier_zscore(df, col, min_z=1, max_z = 5, step = 0.1, print_list = False):
    z_scores = stats.zscore(df[col].dropna())
    threshold_list = []
    for threshold in np.arange(min_z, max_z, step):
        threshold_list.append((threshold, len(np.where(z_scores > threshold)[0])))
        df_outlier = pd.DataFrame(threshold_list, columns = ['threshold', 'outlier_count'])
        df_outlier['pct'] = (df_outlier.outlier_count - df_outlier.outlier_count.shift(-1))/df_outlier.outlier_count*100
    plt.plot(df_outlier.threshold, df_outlier.outlier_count)
    best_treshold = round(df_outlier.iloc[df_outlier.pct.argmax(), 0],2)
    outlier_limit = int(df[col].dropna().mean() + (df[col].dropna().std()) * df_outlier.iloc[df_outlier.pct.argmax(), 0])
    percentile_threshold = stats.percentileofscore(df[col].dropna(), outlier_limit)
    plt.vlines(best_treshold, 0, df_outlier.outlier_count.max(), 
               colors="r", ls = ":"
              )
    plt.annotate("Zscore : {}\nValue : {}\nPercentile : {}".format(best_treshold, outlier_limit, 
                                                                   (np.round(percentile_threshold, 3), 
                                                                    np.round(100-percentile_threshold, 3))), 
                 (best_treshold, df_outlier.outlier_count.max()/2))
    #plt.show()
    if print_list:
        print(df_outlier)
    return (plt, df_outlier, best_treshold, outlier_limit, percentile_threshold)

def outlier_inspect(df, col, min_z=1, max_z = 5, step = 0.5, max_hist = None, bins = 50):
    fig = plt.figure(figsize=(20, 6))
    fig.suptitle(col, fontsize=16)
    plt.subplot(1,3,1)
    if max_hist == None:
        sns.distplot(df[col], kde=False, bins = 50)
    else :
        sns.distplot(df[df[col]<=max_hist][col], kde=False, bins = 50)
    plt.subplot(1,3,2)
    sns.boxplot(df[col])
    plt.subplot(1,3,3)
    z_score_inspect = outlier_zscore(df, col, min_z=min_z, max_z = max_z, step = step)
    plt.show()

# Load | Read Data

In [5]:
# 2-Load|Read Data
csv_path = "covtype.csv"
df = pd.read_csv(csv_path)
df = df.copy() 
# drop_columns = "id"
# df.head()
# df.shape
# df.columns= df.columns.str.lower().str.replace('&', '_').str.replace(' ', '_')
# df.nunique()
# df.info()
# df.shape
# df.isnull().sum()
# missing(df)
# df.drop(drop_columns, axis=1, inplace=True)
# df.shape
# df.describe().T
# df.columns

In [6]:
df.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area1,Wilderness_Area2,Wilderness_Area3,Wilderness_Area4,Soil_Type1,Soil_Type2,Soil_Type3,Soil_Type4,Soil_Type5,Soil_Type6,Soil_Type7,Soil_Type8,Soil_Type9,Soil_Type10,Soil_Type11,Soil_Type12,Soil_Type13,Soil_Type14,Soil_Type15,Soil_Type16,Soil_Type17,Soil_Type18,Soil_Type19,Soil_Type20,Soil_Type21,Soil_Type22,Soil_Type23,Soil_Type24,Soil_Type25,Soil_Type26,Soil_Type27,Soil_Type28,Soil_Type29,Soil_Type30,Soil_Type31,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
0,2596,51,3,258,0,510,221,232,148,6279,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5
1,2590,56,2,212,-6,390,220,235,151,6225,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5
2,2804,139,9,268,65,3180,234,238,135,6121,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
3,2785,155,18,242,118,3090,238,238,122,6211,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2
4,2595,45,2,153,-1,391,220,234,150,6172,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5


In [7]:
#df.shape

(581012, 55)

In [8]:
df.columns= df.columns.str.lower().str.replace('&', '_').str.replace(' ', '_')

In [9]:
df.nunique()

elevation                             1978
aspect                                 361
slope                                   67
horizontal_distance_to_hydrology       551
vertical_distance_to_hydrology         700
horizontal_distance_to_roadways       5785
hillshade_9am                          207
hillshade_noon                         185
hillshade_3pm                          255
horizontal_distance_to_fire_points    5827
wilderness_area1                         2
wilderness_area2                         2
wilderness_area3                         2
wilderness_area4                         2
soil_type1                               2
soil_type2                               2
soil_type3                               2
soil_type4                               2
soil_type5                               2
soil_type6                               2
soil_type7                               2
soil_type8                               2
soil_type9                               2
soil_type10

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 581012 entries, 0 to 581011
Data columns (total 55 columns):
 #   Column                              Non-Null Count   Dtype
---  ------                              --------------   -----
 0   elevation                           581012 non-null  int64
 1   aspect                              581012 non-null  int64
 2   slope                               581012 non-null  int64
 3   horizontal_distance_to_hydrology    581012 non-null  int64
 4   vertical_distance_to_hydrology      581012 non-null  int64
 5   horizontal_distance_to_roadways     581012 non-null  int64
 6   hillshade_9am                       581012 non-null  int64
 7   hillshade_noon                      581012 non-null  int64
 8   hillshade_3pm                       581012 non-null  int64
 9   horizontal_distance_to_fire_points  581012 non-null  int64
 10  wilderness_area1                    581012 non-null  int64
 11  wilderness_area2                    581012 non-null 

In [12]:
df.isnull().sum()

elevation                             0
aspect                                0
slope                                 0
horizontal_distance_to_hydrology      0
vertical_distance_to_hydrology        0
horizontal_distance_to_roadways       0
hillshade_9am                         0
hillshade_noon                        0
hillshade_3pm                         0
horizontal_distance_to_fire_points    0
wilderness_area1                      0
wilderness_area2                      0
wilderness_area3                      0
wilderness_area4                      0
soil_type1                            0
soil_type2                            0
soil_type3                            0
soil_type4                            0
soil_type5                            0
soil_type6                            0
soil_type7                            0
soil_type8                            0
soil_type9                            0
soil_type10                           0
soil_type11                           0


In [13]:
missing(df)

Unnamed: 0,Missing_Number,Missing_Percent
elevation,0,0.0
soil_type28,0,0.0
soil_type17,0,0.0
soil_type18,0,0.0
soil_type19,0,0.0
soil_type20,0,0.0
soil_type21,0,0.0
soil_type22,0,0.0
soil_type23,0,0.0
soil_type24,0,0.0


In [14]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
elevation,581012.0,2959.365,279.985,1859.0,2809.0,2996.0,3163.0,3858.0
aspect,581012.0,155.657,111.914,0.0,58.0,127.0,260.0,360.0
slope,581012.0,14.104,7.488,0.0,9.0,13.0,18.0,66.0
horizontal_distance_to_hydrology,581012.0,269.428,212.549,0.0,108.0,218.0,384.0,1397.0
vertical_distance_to_hydrology,581012.0,46.419,58.295,-173.0,7.0,30.0,69.0,601.0
horizontal_distance_to_roadways,581012.0,2350.147,1559.255,0.0,1106.0,1997.0,3328.0,7117.0
hillshade_9am,581012.0,212.146,26.77,0.0,198.0,218.0,231.0,254.0
hillshade_noon,581012.0,223.319,19.769,0.0,213.0,226.0,237.0,254.0
hillshade_3pm,581012.0,142.528,38.275,0.0,119.0,143.0,168.0,254.0
horizontal_distance_to_fire_points,581012.0,1980.291,1324.195,0.0,1024.0,1710.0,2550.0,7173.0


In [15]:
df.columns

Index(['elevation', 'aspect', 'slope', 'horizontal_distance_to_hydrology',
       'vertical_distance_to_hydrology', 'horizontal_distance_to_roadways',
       'hillshade_9am', 'hillshade_noon', 'hillshade_3pm',
       'horizontal_distance_to_fire_points', 'wilderness_area1',
       'wilderness_area2', 'wilderness_area3', 'wilderness_area4',
       'soil_type1', 'soil_type2', 'soil_type3', 'soil_type4', 'soil_type5',
       'soil_type6', 'soil_type7', 'soil_type8', 'soil_type9', 'soil_type10',
       'soil_type11', 'soil_type12', 'soil_type13', 'soil_type14',
       'soil_type15', 'soil_type16', 'soil_type17', 'soil_type18',
       'soil_type19', 'soil_type20', 'soil_type21', 'soil_type22',
       'soil_type23', 'soil_type24', 'soil_type25', 'soil_type26',
       'soil_type27', 'soil_type28', 'soil_type29', 'soil_type30',
       'soil_type31', 'soil_type32', 'soil_type33', 'soil_type34',
       'soil_type35', 'soil_type36', 'soil_type37', 'soil_type38',
       'soil_type39', 'soil_type40

In [15]:
import missingno as msno 

In [17]:
#msno.bar(df)

In [18]:
#msno.matrix(df);

In [19]:
#msno.matrix(df_female);

In [20]:
# drop_columns = []
# drop_columns.append("subjectid")

In [21]:
#unneccesseray_data = list(df.describe(include=object).columns.drop('gender'))

In [22]:
# drop_columns.extend(unneccesseray_data )

In [23]:
# drop_columns.append('age')

In [24]:
# df.drop(drop_columns, axis=1, inplace=True)

In [25]:
# df.shape

In [16]:
df['cover_type'].value_counts()

2    283301
1    211840
3     35754
7     20510
6     17367
5      9493
4      2747
Name: cover_type, dtype: int64

# Exploratory Data Analysis and Visualization

## Features | Target

In [17]:
df.duplicated(subset=None, keep='first').sum()

0

In [18]:
# 3-Target Examination
target = "cover_type"

# df.duplicated(subset=None, keep='first').sum()
df.drop_duplicates(keep = 'first', inplace = True)

# df = df.dropna()

X_columns = df.drop(target, axis=1).columns
X_categorical = df.drop(target, axis=1).select_dtypes('object')
X_numerical = df.drop(target, axis=1).select_dtypes('number').astype('float64')

# df[target].value_counts()
# X_columns
# X_numerical.columns
# X_categorical.columns
# X_numerical.columns.values

In [19]:
df[target].value_counts()

2    283301
1    211840
3     35754
7     20510
6     17367
5      9493
4      2747
Name: cover_type, dtype: int64

## Numerical Features

In [None]:
# index = 0
# plt.figure(figsize=(20,20))
# for feature in X_numerical.columns:
#     if feature != target:
#         index += 1
#         plt.subplot(5,4,index)
#         sns.boxplot(x=target,y=feature,data=df);

In [20]:
df.corr().style.background_gradient(cmap='RdPu')

Unnamed: 0,elevation,aspect,slope,horizontal_distance_to_hydrology,vertical_distance_to_hydrology,horizontal_distance_to_roadways,hillshade_9am,hillshade_noon,hillshade_3pm,horizontal_distance_to_fire_points,wilderness_area1,wilderness_area2,wilderness_area3,wilderness_area4,soil_type1,soil_type2,soil_type3,soil_type4,soil_type5,soil_type6,soil_type7,soil_type8,soil_type9,soil_type10,soil_type11,soil_type12,soil_type13,soil_type14,soil_type15,soil_type16,soil_type17,soil_type18,soil_type19,soil_type20,soil_type21,soil_type22,soil_type23,soil_type24,soil_type25,soil_type26,soil_type27,soil_type28,soil_type29,soil_type30,soil_type31,soil_type32,soil_type33,soil_type34,soil_type35,soil_type36,soil_type37,soil_type38,soil_type39,soil_type40,cover_type
elevation,1.0,0.015735,-0.242697,0.306229,0.093306,0.365559,0.112179,0.205887,0.059148,0.148022,0.131838,0.238164,0.06655,-0.619374,-0.204512,-0.187677,-0.182463,-0.183521,-0.150376,-0.214606,-0.002252,-0.003021,-0.060915,-0.428746,-0.134227,-0.118905,-0.043984,-0.080825,-0.007153,-0.059446,-0.111028,-0.081811,0.033144,-0.043128,0.017557,0.158959,0.124356,0.053582,0.028753,-0.016657,0.035254,-0.02927,0.074327,-0.026667,0.070405,0.167077,0.070633,0.011731,0.083005,0.021107,0.035433,0.217179,0.193595,0.212612,-0.269554
aspect,0.015735,1.0,0.078728,0.017376,0.070305,0.025121,-0.579273,0.336103,0.646944,-0.109172,-0.140123,0.055988,0.074904,0.082687,-0.007574,-0.005649,-0.00273,0.017212,0.008938,0.010766,-0.005052,-0.003366,-0.0208,0.049835,-0.064344,-0.070209,0.054544,0.007597,-0.00266,0.007846,-0.000168,-0.028353,-0.003635,-0.02944,0.032998,0.021578,0.013676,0.018164,-0.003265,-0.010661,0.011328,0.027535,-0.062181,-0.028922,0.001763,0.056233,0.019163,0.010861,-0.021991,0.002281,-0.020398,0.017706,0.008294,-0.005866,0.01708
slope,-0.242697,0.078728,1.0,-0.010607,0.274976,-0.215914,-0.327199,-0.526911,-0.175854,-0.185662,-0.234576,-0.036253,0.125663,0.255503,0.107847,-0.018553,0.125497,0.131847,0.072311,0.003673,-0.015661,-0.023359,-0.032752,0.244037,-0.050894,-0.1693,0.192423,0.000228,0.001081,-0.034791,-0.040208,-0.045851,-0.083743,-0.077582,-0.025461,-0.053396,-0.207397,0.082434,0.026364,-0.021449,0.043695,0.067052,-0.082941,0.075864,-0.03461,-0.133504,0.208942,-0.011002,-0.022228,0.002918,0.007848,-0.072208,0.093602,0.025637,0.148285
horizontal_distance_to_hydrology,0.306229,0.017376,-0.010607,1.0,0.606236,0.07203,-0.027088,0.04679,0.05233,0.051874,-0.097124,0.055726,0.122028,-0.100433,-0.035096,-0.011569,-0.041211,-0.049071,-0.00937,-0.012916,0.004751,-0.000795,-0.021935,-0.071653,0.001399,0.014628,-0.002032,-0.038478,-0.002667,-0.067448,-0.071435,-0.01334,-0.043236,-0.078088,-0.039953,-0.051424,-0.132244,0.021927,0.016099,0.013408,0.052384,0.02621,-0.001025,-0.04996,0.073658,0.127217,0.101195,0.070268,-0.005231,0.033421,-0.006802,0.043031,0.031922,0.14702,-0.020317
vertical_distance_to_hydrology,0.093306,0.070305,0.274976,0.606236,1.0,-0.046372,-0.166333,-0.110957,0.034902,-0.069913,-0.18071,-0.008709,0.146839,0.077792,0.015275,0.008954,0.008863,0.025066,0.026772,0.046259,-0.008485,-0.012915,-0.028476,0.055154,-0.02087,-0.044526,0.083482,-0.024281,-0.001744,-0.050909,-0.054191,-0.031692,-0.055635,-0.076727,-0.026116,-0.075679,-0.180098,0.037066,-0.013471,-0.011212,0.067086,0.071672,-0.07586,-0.011901,0.033609,0.039762,0.167091,0.060274,-0.006092,0.012955,-0.00752,-0.008629,0.043859,0.179006,0.081664
horizontal_distance_to_roadways,0.365559,0.025121,-0.215914,0.07203,-0.046372,1.0,0.034349,0.189461,0.106119,0.33158,0.453913,-0.200411,-0.232933,-0.270349,-0.083585,-0.088026,-0.084988,-0.088524,-0.061607,-0.108328,0.020107,0.025805,-0.045813,-0.182955,-0.099293,0.054196,-0.054968,-0.033945,-0.003144,0.018083,-0.051825,-0.051243,0.068758,0.056595,-0.01489,0.046979,-0.007067,-0.032451,-0.034842,0.002521,0.003866,-0.032749,0.306324,0.077091,-0.05884,-0.089019,-0.082779,0.00639,-0.003,0.00755,0.016313,0.079778,0.033762,0.016052,-0.15345
hillshade_9am,0.112179,-0.579273,-0.327199,-0.027088,-0.166333,0.034349,1.0,0.010037,-0.780296,0.132669,0.201299,-0.006181,-0.100565,-0.200282,-0.000937,0.036253,0.039648,0.023812,-0.046514,-0.005665,0.003571,0.005,0.021741,-0.223782,0.048371,0.092364,-0.07339,-0.010719,-0.000522,-0.00659,0.0047,0.031293,0.017103,0.024811,-0.014162,0.000252,0.036234,-0.112379,0.032783,0.027388,0.001638,-0.091435,0.081499,0.104003,-0.035114,0.006494,-0.064381,0.007154,0.02787,0.007865,0.010332,0.015108,-0.02962,-1.6e-05,-0.035415
hillshade_noon,0.205887,0.336103,-0.526911,0.04679,-0.110957,0.189461,0.010037,1.0,0.594274,0.057329,0.028728,0.042392,0.048646,-0.195733,-0.052561,0.04325,0.002702,0.084397,-0.062044,-0.010497,0.005282,0.00952,0.005446,-0.245854,-0.011993,0.058469,0.061922,0.000969,-0.002872,0.015544,0.028664,0.01517,0.03714,0.015826,0.029727,0.032096,0.118746,-0.128597,0.007276,0.04122,0.019941,-0.004998,-0.017877,-0.030526,-9.5e-05,0.125395,-0.086164,0.043061,0.005863,0.016239,-0.022707,0.042952,-0.071961,-0.040176,-0.096426
hillshade_3pm,0.059148,0.646944,-0.175854,0.05233,0.034902,0.106119,-0.780296,0.594274,1.0,-0.047981,-0.115155,0.034707,0.090757,0.01886,-0.050157,-0.005276,-0.060554,-0.00467,-0.0069,-0.000556,0.001852,0.003576,-0.010428,0.019923,-0.03564,-0.020883,0.052648,0.009826,-0.00112,0.019312,0.016858,-0.011445,0.016273,0.000494,0.031423,0.027286,0.063686,0.020953,-0.029094,0.002145,0.000383,0.059661,-0.059882,-0.11738,0.040475,0.083066,-0.024393,0.017757,-0.016482,0.00133,-0.022064,0.022187,-0.02904,-0.024254,-0.04829
horizontal_distance_to_fire_points,0.148022,-0.109172,-0.185662,0.051874,-0.069913,0.33158,0.132669,0.057329,-0.047981,1.0,0.380568,0.027473,-0.27751,-0.236548,-0.073607,-0.081716,-0.07634,-0.076478,-0.051845,-0.087305,0.02839,0.032796,-0.036639,-0.175974,-0.042799,0.26172,-0.092053,-0.032645,-0.002541,0.073795,-0.021689,0.107228,0.006165,0.108575,-0.024113,-0.024772,-0.025421,0.007914,0.036232,0.021583,-0.00362,-0.018298,0.215194,0.054713,-0.066258,-0.089977,-0.059067,-0.035067,-8.1e-05,-0.010595,0.00418,-0.01974,-0.003301,0.008915,-0.108936


In [21]:
df.shape

(581012, 55)

In [22]:
def correlation(dataset, threshold):
    col_corr = set() # Set of all the names of deleted columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):
                colname = corr_matrix.columns[i] # getting the name of column
                col_corr.add(colname)
                if colname in dataset.columns:
                    del dataset[colname] # deleting the column from the dataset

    print(dataset)

In [23]:
correlation(df, 0.9)

        elevation  aspect  slope  horizontal_distance_to_hydrology  \
0            2596      51      3                               258   
1            2590      56      2                               212   
2            2804     139      9                               268   
3            2785     155     18                               242   
4            2595      45      2                               153   
...           ...     ...    ...                               ...   
581007       2396     153     20                                85   
581008       2391     152     19                                67   
581009       2386     159     17                                60   
581010       2384     170     15                                60   
581011       2383     165     13                                60   

        vertical_distance_to_hydrology  horizontal_distance_to_roadways  \
0                                    0                              510   
1        

In [None]:
df.corr().style.background_gradient(cmap='RdPu')

In [32]:
df.shape

(581012, 55)

In [30]:
def outlier_zscore(df, col, min_z=1, max_z = 5, step = 0.1, print_list = False):
    z_scores = stats.zscore(df[col].dropna())
    threshold_list = []
    for threshold in np.arange(min_z, max_z, step):
        threshold_list.append((threshold, len(np.where(z_scores > threshold)[0])))
        df_outlier = pd.DataFrame(threshold_list, columns = ['threshold', 'outlier_count'])
        df_outlier['pct'] = (df_outlier.outlier_count - df_outlier.outlier_count.shift(-1))/df_outlier.outlier_count*100
    plt.plot(df_outlier.threshold, df_outlier.outlier_count)
    best_treshold = round(df_outlier.iloc[df_outlier.pct.argmax(), 0],2)
    outlier_limit = int(df[col].dropna().mean() + (df[col].dropna().std()) * df_outlier.iloc[df_outlier.pct.argmax(), 0])
    percentile_threshold = stats.percentileofscore(df[col].dropna(), outlier_limit)
    plt.vlines(best_treshold, 0, df_outlier.outlier_count.max(), 
               colors="r", ls = ":"
              )
    plt.annotate("Zscore : {}\nValue : {}\nPercentile : {}".format(best_treshold, outlier_limit, 
                                                                   (np.round(percentile_threshold, 3), 
                                                                    np.round(100-percentile_threshold, 3))), 
                 (best_treshold, df_outlier.outlier_count.max()/2))
    #plt.show()
    if print_list:
        print(df_outlier)
    return (plt, df_outlier, best_treshold, outlier_limit, percentile_threshold)

def outlier_inspect(df, col, min_z=1, max_z = 5, step = 0.5, max_hist = None, bins = 50):
    fig = plt.figure(figsize=(20, 6))
    fig.suptitle(col, fontsize=16)
    plt.subplot(1,3,1)
    if max_hist == None:
        sns.distplot(df[col], kde=False, bins = 50)
    else :
        sns.distplot(df[df[col]<=max_hist][col], kde=False, bins = 50)
    plt.subplot(1,3,2)
    sns.boxplot(df[col])
    plt.subplot(1,3,3)
    z_score_inspect = outlier_zscore(df, col, min_z=min_z, max_z = max_z, step = step)
    plt.show()
    
    
def outlier_inspect_neg(df, col, min_z=-5, max_z = 1, step = -0.5, max_hist = None, bins = 50):
    fig = plt.figure(figsize=(20, 6))
    fig.suptitle(col, fontsize=16)
    plt.subplot(1,3,1)
    if max_hist == None:
        sns.distplot(df[col], kde=False, bins = 50)
    else :
        sns.distplot(df[df[col]<=max_hist][col], kde=False, bins = 50)
    plt.subplot(1,3,2)
    sns.boxplot(df[col])
    plt.subplot(1,3,3)
    z_score_inspect = outlier_zscore(df, col, min_z=min_z, max_z = max_z, step = step)
    plt.show()

In [32]:
for i in df.columns[:10]:
    outlier_inspect(df, i)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [36]:
z_scores = stats.zscore(df)
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 2.5).all(axis=1)
df2 = df[filtered_entries]

print(df2)

        elevation  aspect  slope  horizontal_distance_to_hydrology  \
27           2962     148     16                               323   
35           2900      45     19                               242   
61           2952     107     11                                42   
67           2919      13     13                                90   
72           2893     114     16                               108   
...           ...     ...    ...                               ...   
488234       3274     193     16                               331   
488235       3277     188     18                               330   
488236       3277     175     18                               330   
488237       3275     168     19                               331   
488238       3273     163     18                               335   

        vertical_distance_to_hydrology  horizontal_distance_to_roadways  \
27                                  23                             5916   
35       

In [27]:
for i in new_df.columns[:10]:
    outlier_inspect(new_df, i)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [33]:
def detect_outliers(df, col_name):
    """
    This function detects outliers based on 3 time IQR and returns the number of
    lower and upper limit and number of outliers respectively
    """
    first_quartile = np.percentile(np.array(df[col_name].tolist()), 25)
    third_quartile = np.percentile(np.array(df[col_name].tolist()), 75)
    IQR = third_quartile - first_quartile
    upper_limit = third_quartile + (3 * IQR)
    lower_limit = first_quartile - (3 * IQR)
    outlier_count = 0
    for value in df[col_name].tolist():
        if (value < lower_limit) | (value > upper_limit):
            outlier_count += 1
    return lower_limit, upper_limit, outlier_count

In [34]:
for col in df:
    if detect_outliers(df, col)[2] > 0:
        print("There are {} outliers in {}".format(detect_outliers(df, col)[2], col))

There are 275 outliers in slope
There are 414 outliers in horizontal_distance_to_hydrology
There are 5339 outliers in vertical_distance_to_hydrology
There are 1027 outliers in hillshade_9am
There are 1191 outliers in hillshade_noon
There are 10 outliers in horizontal_distance_to_fire_points
There are 29884 outliers in wilderness_area2
There are 36968 outliers in wilderness_area4
There are 3031 outliers in soil_type1
There are 7525 outliers in soil_type2
There are 4823 outliers in soil_type3
There are 12396 outliers in soil_type4
There are 1597 outliers in soil_type5
There are 6575 outliers in soil_type6
There are 105 outliers in soil_type7
There are 179 outliers in soil_type8
There are 1147 outliers in soil_type9
There are 32634 outliers in soil_type10
There are 12410 outliers in soil_type11
There are 29971 outliers in soil_type12
There are 17431 outliers in soil_type13
There are 599 outliers in soil_type14
There are 3 outliers in soil_type15
There are 2845 outliers in soil_type16
Ther

In [37]:
for col in df.columns:
    df3 = df[(df[col] > detect_outliers(df, col)[0]) & (df[col] < detect_outliers(df, col)[1])]

In [40]:
for i in df[:10]:
    df1 = df[~df[i].transform(lambda x: abs(x-x.mean()) > 1.96*x.std()).values]

In [38]:
df2.shape

(98233, 55)

In [39]:
df3.shape

(533642, 55)

# Model Selection

## Train | Test Split & Scaling

In [33]:
# 10-Train|Test Split, Dummy 

# # Before dummy: 
# make_dtype_object = df[['categorical1','categorical2']].astype('object')

X_columns_ = df.drop(target, axis=1).columns
X_categorical_ = df.drop(target, axis=1).select_dtypes('object')
X_numerical_ = df.drop(target, axis=1).select_dtypes('number').astype('float64')

###############################################################################

if (df.dtypes==object).any():
    dummied = pd.get_dummies(X_categorical_, drop_first=True)
    X = pd.concat([X_numerical_, dummied[dummied.columns]], axis=1)
    
else:
    X = df.drop(target, axis=1).astype('float64')
try:
    if (df[target].dtypes==object).any():
        y = pd.get_dummies(df[target], drop_first=True)
    
except:
    y = df[target]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.20, 
                                                    random_state=42)

###############################################################################

# # 11-MinMax Scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# # 11-Standart Scaling
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)

###############################################################################

In [34]:
shape_control()

df.shape: (581012, 55)
X.shape: (581012, 54)
y.shape: (581012,)
X_train.shape: (464809, 54)
y_train.shape: (464809,)
X_test.shape: (116203, 54)
y_test.shape: (116203,)



## Implement DT and Evaluate¶

In [None]:
## Cross Validation 
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
from sklearn.metrics import make_scorer

cv_model = DecisionTreeClassifier(random_state=42)
scores = cross_validate(cv_model, X_train, y_train, scoring = ["accuracy", "precision_macro", "recall_macro", "f1_macro"], cv = 10)
df_scores = pd.DataFrame(scores, index = range(1, 11))

df_scores.mean()[2:]

In [None]:
# Simple Classifier
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42).fit(X_train_scaled, y_train)
y_test_pred = model.predict(X_test_scaled)
log_acc, log_recall = calc_predict()
get_report()

In [None]:
feature_importances()

In [None]:
feature_importances_bar()

## Implement Logistic Regression and Evaluate¶

In [None]:
## Model Evaluate¶

# 1-Logistic Regression
params = {"penalty" : ["l1", "l2", "elasticnet"],
          "l1_ratio" : np.linspace(0, 1, 20),
          "C" : np.logspace(0, 10, 20)}
model = GridSearchCV(LogisticRegression(random_state=42), 
                     params, 
                     cv=10).fit(X_train_scaled, y_train)

y_test_pred = model.predict(X_test_scaled)
log_acc, log_recall = calc_predict()
get_report()
# train_control_table()
# test_control_table()
# feature_importances()
# feature_importances_bar()
log_acc = accuracy_score(y_test, y_test_pred)
log_recall = recall_score(y_test, y_test_pred)

# # Model tunning
# tuned_model = LogisticRegression(penalty = penalty, 
#                                C = C, 
#                                l1_ratio = l1_ratio, 
#                                solver='saga', 
#                                max_iter=5000).fit(X_train_scaled, y_train)
# y_test_pred = tuned_model.predict(X_test_scaled)

## ROC (Receiver Operating Curve) and AUC (Area Under Curve)

In [None]:
#print(model)
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve
plot_roc_curve(model, X_train_scaled, y_train);
plot_precision_recall_curve(model, X_train_scaled, y_train);

plot_roc_curve(model, X_test_scaled, y_test);
plot_precision_recall_curve(model, X_test_scaled, y_test);

# Implement KNN and Evaluate

In [None]:
## Model Evaluate¶

# 1-KNN Classification
params = {"n_neighbors": np.arange(1, 30)}
model = GridSearchCV(KNeighborsClassifier(), 
                    params, 
                    cv=10).fit(X_train_scaled, y_train)
y_test_pred = model.predict(X_test_scaled)
# knn_acc, knn_recall = calc_predict()
get_report()

In [None]:
#print(model)
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve
plot_roc_curve(model, X_train_scaled, y_train);
plot_precision_recall_curve(model, X_train_scaled, y_train);

plot_roc_curve(model, X_test_scaled, y_test);
plot_precision_recall_curve(model, X_test_scaled, y_test);

In [None]:
# KNN Classification
params = {"n_neighbors": np.arange(1, 30), 
          "p": [1, 2]}
model = GridSearchCV(KNeighborsClassifier(), 
                     params, 
                     cv=10).fit(X_train_scaled, y_train)
y_test_pred = model.predict(X_test_scaled)
# knn_acc, knn_recall = calc_predict()
get_report()

In [None]:
#print(model)
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve
plot_roc_curve(model, X_train_scaled, y_train);
plot_precision_recall_curve(model, X_train_scaled, y_train);

plot_roc_curve(model, X_test_scaled, y_test);
plot_precision_recall_curve(model, X_test_scaled, y_test);

In [None]:
# KNN Classification
params = {"n_neighbors": np.arange(1, 50), 
          "p": [1,2], 
          "weights": ['uniform', "distance"]}
model = GridSearchCV(KNeighborsClassifier(), 
                     params, 
                     cv=10).fit(X_train_scaled, y_train)
y_test_pred = model.predict(X_test_scaled)
knn_acc, knn_recall = calc_predict()
get_report()

In [None]:
#print(model)
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve
plot_roc_curve(model, X_train_scaled, y_train);
plot_precision_recall_curve(model, X_train_scaled, y_train);

plot_roc_curve(model, X_test_scaled, y_test);
plot_precision_recall_curve(model, X_test_scaled, y_test);

## Implement SVM and Evaluate

In [None]:
# SVM Classification
from sklearn.model_selection import GridSearchCV
params = {'C': [0.1,1, 10, 100, 1000],
          'gamma': ["scale", "auto", 1,0.1,0.01,0.001,0.0001],
          'kernel': ['rbf', 'linear', 'poly']}
model = GridSearchCV(SVC(random_state=42), 
                     params, 
                     verbose=3, 
                     refit=True, 
                     cv=10).fit(X_train_scaled, y_train)
y_test_pred = model.predict(X_test_scaled)
svm_acc, svm_recall = calc_predict()
get_report()

In [None]:
#print(model)
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve
plot_roc_curve(model, X_train_scaled, y_train);
plot_precision_recall_curve(model, X_train_scaled, y_train);

plot_roc_curve(model, X_test_scaled, y_test);
plot_precision_recall_curve(model, X_test_scaled, y_test);

## Implement XGBoost and Evaluate

In [None]:
shape_control()

In [None]:
from xgboost import XGBClassifier
params = {"n_estimators":[100, 300],
          "max_depth":[3,5,6], 
          "learning_rate": [0.1, 0.3],
          "subsample":[0.5, 1],
          "colsample_bytree":[0.5, 1]}
model = GridSearchCV(XGBClassifier(random_state=101), 
                     params, 
                     scoring="f1", 
                     verbose=2, 
                     n_jobs=-1,
                     cv=5).fit(X_train_scaled, y_train)
y_test_pred = model.predict(X_test_scaled)
xg_acc, xg_recall = calc_predict()
get_report()

In [None]:
#print(model)
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve
plot_roc_curve(model, X_train_scaled, y_train);
plot_precision_recall_curve(model, X_train_scaled, y_train);

plot_roc_curve(model, X_test_scaled, y_test);
plot_precision_recall_curve(model, X_test_scaled, y_test);

In [36]:
# !pip install lightgbm
# conda install -c conda-forge lightgbm
from lightgbm import LGBMClassifier
params = {'n_estimators': [100, 500, 1000, 2000],
          'subsample': [0.6, 0.8, 1.0],
          'max_depth': [3, 4, 5,6],
          'learning_rate': [0.1,0.01,0.02,0.05],
          "min_child_samples": [5,10,20]}
model = GridSearchCV(LGBMClassifier(random_state=101), 
                     params, 
                     scoring="f1", 
                     verbose=2, 
                     n_jobs=-1,
                     cv=5).fit(X_train_scaled, y_train)
y_test_pred = model.predict(X_test_scaled)
lgbm_acc, lgbm_recall = calc_predict()
get_report()



Model: <bound method LGBMModel.get_params of LGBMClassifier(random_state=101)> 


Train:
rmse: 0.6951216959855905
accuracy: 0.8640581400101978


confusion_matrix:

 [[139487  28762      4      9    108     22    891]
 [ 22086 202136   1055     21    581    857     65]
 [     1    991  26456     91     12   1082      0]
 [     0      2    111   2063      0     45      0]
 [    38   2101    101      0   5244     14      0]
 [    13    837   1608     44      3  11373      0]
 [  1576     55      0      0      1      0  14863]] 

classification_report:

               precision    recall  f1-score   support

           1       0.85      0.82      0.84    169283
           2       0.86      0.89      0.88    226801
           3       0.90      0.92      0.91     28633
           4       0.93      0.93      0.93      2221
           5       0.88      0.70      0.78      7498
           6       0.85      0.82      0.83     13878
           7       0.94      0.90      0.92     16495

    acc

## Implement RandomForest and Evaluate

In [38]:
# RandomForest Classification
params = {"max_depth": [2,5,8,10],
          "max_features": [2,5,8],
          "n_estimators": [10,500,1000],
          "min_samples_split": [2,5,10]}
model = GridSearchCV(RandomForestClassifier(random_state=42), 
                     params, 
                     n_jobs=-1, 
                     verbose=2, 
                     refit=True,
                     cv=10).fit(X_train_scaled, y_train)
y_test_pred = model.predict(X_test_scaled)
rf_acc, rf_recall = calc_predict()
get_report()



Model: <bound method BaseEstimator.get_params of RandomForestClassifier(random_state=101)> 


Train:
rmse: 0.0
accuracy: 1.0


confusion_matrix:

 [[169283      0      0      0      0      0      0]
 [     0 226801      0      0      0      0      0]
 [     0      0  28633      0      0      0      0]
 [     0      0      0   2221      0      0      0]
 [     0      0      0      0   7498      0      0]
 [     0      0      0      0      0  13878      0]
 [     0      0      0      0      0      0  16495]] 

classification_report:

               precision    recall  f1-score   support

           1       1.00      1.00      1.00    169283
           2       1.00      1.00      1.00    226801
           3       1.00      1.00      1.00     28633
           4       1.00      1.00      1.00      2221
           5       1.00      1.00      1.00      7498
           6       1.00      1.00      1.00     13878
           7       1.00      1.00      1.00     16495

    accuracy             

In [None]:
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve
model = gridCV_model
plot_roc_curve(model, X_train_scaled, y_train);
plot_precision_recall_curve(model, X_train_scaled, y_train);

model = gridCV_model
plot_roc_curve(model, X_test_scaled, y_test);
plot_precision_recall_curve(model, X_test_scaled, y_test);

# Data Preprocessing

# Visually compare models based on your chosen metric

# Chose best model and make a random prediction

In [None]:
compare = pd.DataFrame({"Model": ["LR", "KNN", "SVM", "DT", "RF"],
                        "Accuracy": [dt_acc, log_acc, knn_acc, svm_acc, xg_acc, rf_acc],
                        "Recall": [dt_recall, log_recall, knn_recall, svm_recall, xg_recall, rf_recall]})

def labels(ax):
    for p in ax.patches:
        width = p.get_width()                        # get bar length
        ax.text(width,                               # set the text at 1 unit right of the bar
                p.get_y() + p.get_height() / 2,      # get Y coordinate + X coordinate / 2
                '{:1.2f}'.format(width),             # set variable to display, 2 decimals
                ha='left',                         # horizontal alignment
                va='center')                       # vertical alignment
    
plt.figure(figsize=(14,10))
plt.subplot(211)
compare = compare.sort_values(by="Accuracy", ascending=False)
ax=sns.barplot(x="Accuracy", y="Model", data=compare, palette="Blues_d")
labels(ax)

plt.subplot(212)
compare = compare.sort_values(by="Recall", ascending=False)
ax=sns.barplot(x="Recall", y="Model", data=compare, palette="Blues_d")
labels(ax)
plt.show()