# Baseline Models

Here, I will run some baseline models on the data. After splitting the data into train and test sets, I will run it through various model types to see which ones perform the best. Those that work best will be fine-tuned later. I will use accuracy as my deciding metric, but precision and recall will let me know what values I'm having trouble classifying, and where I can improve.

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
from scripts import get_metrics
from sklearn.model_selection import train_test_split

In [3]:
df = pd.read_csv('../data/cleaned_data.csv', index_col='id')
df

Unnamed: 0_level_0,status_group,longitude,latitude,population,construction_year,funder_communal standpipe,funder_communal standpipe multiple,funder_hand pump,funder_improved spring,funder_other,...,source_other,source_rainwater harvesting,source_river,source_shallow well,source_spring,waterpoint_type_communal standpipe,waterpoint_type_communal standpipe multiple,waterpoint_type_hand pump,waterpoint_type_improved spring,waterpoint_type_other
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
69572,functional,0.496455,0.168353,0.003541,0.735849,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
8776,functional,0.474167,0.892122,0.009148,0.943396,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
34310,functional,0.731374,0.734967,0.008164,0.924528,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
67743,non functional,0.826875,0.046394,0.001869,0.490566,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
19728,functional,0.141899,0.922364,0.013692,0.852830,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60739,functional,0.704287,0.788246,0.004066,0.735849,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
27263,functional,0.525501,0.242120,0.001803,0.679245,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
37057,functional,0.410685,0.272182,0.003836,0.924528,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
31282,functional,0.582432,0.494872,0.009148,0.841509,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0


In [4]:
y = df['status_group']
X = df.drop(['status_group'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=212)

In [5]:
models = []

## Class Imbalance

As noted in the EDA, there is a class imbalance in the data; 54:39:7 funcitonal:non functional:needs repair. In order to address this, I will use the combined SMOTE and Tomek Links functions of the imbalanced learn library.

In [6]:
#from imblearn.combine import SMOTETomek

In [7]:
#resampler = SMOTETomek(random_state=42)
#X_train_resampled, y_train_resampled = resampler.fit_resample(X_train, y_train)

In [8]:
#pd.DataFrame(y_train_resampled)[0].value_counts(normalize=True)

After running tests with this, I have realized that under- and over-sampling methods reduce accuracy. They improved precision and recall - and therefore the f1 scores - but those aren't my metrics for success in this project. The competition is using accuracy as it's deciding metric, and so I won't use these methods in my final models.

## Logistic Regression

In [9]:
from sklearn.linear_model import LogisticRegression

In [10]:
logreg = LogisticRegression(fit_intercept=False, C=1e12, solver='liblinear')
logreg.fit(X_train, y_train);

In [11]:
metrics = get_metrics(y_test, X_test, logreg)
metrics['name'] = 'Logistic Regression'
models.append(metrics)

## K Nearest Neighbors

In [12]:
from sklearn.neighbors import KNeighborsClassifier

In [13]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train);

In [14]:
metrics = get_metrics(y_test, X_test, knn)
metrics['name'] = 'K Nearest Neighbors'
models.append(metrics)

## Naive Bayes

In [15]:
from sklearn.naive_bayes import GaussianNB

In [16]:
bayes = GaussianNB()
bayes.fit(X_train, y_train);

In [17]:
metrics = get_metrics(y_test, X_test, bayes)
metrics['name'] = 'Naive Bayes'
models.append(metrics)

## Decision Tree

In [18]:
from sklearn.tree import DecisionTreeClassifier

In [19]:
tree = DecisionTreeClassifier(random_state=12)
tree.fit(X_train, y_train);

In [20]:
metrics = get_metrics(y_test, X_test, tree)
metrics['name'] = 'Decision Tree'
models.append(metrics)

## Bagged Trees

In [21]:
from sklearn.ensemble import BaggingClassifier

In [22]:
bag = BaggingClassifier(DecisionTreeClassifier(random_state=12), random_state=12)  
bag.fit(X_train, y_train);

In [23]:
metrics = get_metrics(y_test, X_test, bag)
metrics['name'] = 'Bagged Trees'
models.append(metrics)

## Random Forest

In [24]:
from sklearn.ensemble import RandomForestClassifier

In [25]:
forest = RandomForestClassifier(random_state=12)
forest.fit(X_train, y_train);

In [26]:
metrics = get_metrics(y_test, X_test, forest)
metrics['name'] = 'Random Forest'
models.append(metrics)

## AdaBoost

In [27]:
from sklearn.ensemble import AdaBoostClassifier

In [28]:
adaboost = AdaBoostClassifier(random_state=12)
adaboost.fit(X_train, y_train);

In [29]:
metrics = get_metrics(y_test, X_test, adaboost)
metrics['name'] = 'AdaBoost'
models.append(metrics)

## Gradient Boosting

In [30]:
from sklearn.ensemble import GradientBoostingClassifier

In [31]:
grad_boost = GradientBoostingClassifier(random_state=12)
grad_boost.fit(X_train, y_train);

In [32]:
metrics = get_metrics(y_test, X_test, grad_boost)
metrics['name'] = 'Gradient Boosting'
models.append(metrics)

## XGBoost

In [33]:
from xgboost import XGBClassifier

In [34]:
xgb = XGBClassifier(random_state=12)
xgb.fit(X_train, y_train);

In [35]:
metrics = get_metrics(y_test, X_test, xgb)
metrics['name'] = 'XG Boost'
models.append(metrics)

## Support Vector Machines

In [36]:
from sklearn.svm import SVC

In [37]:
svc = SVC(random_state=12)
svc.fit(X_train, y_train);

In [38]:
metrics = get_metrics(y_test, X_test, svc)
metrics['name'] = 'Support Vector Machine'
models.append(metrics)

## Analysis

In [39]:
models_df = pd.DataFrame(models)
models_df.sort_values(by='accuracy', ascending=False)

Unnamed: 0,accuracy,f1,precision,recall,name
5,0.792997,0.786769,0.784668,0.792997,Random Forest
4,0.783771,0.777718,0.776013,0.783771,Bagged Trees
1,0.779798,0.771395,0.770994,0.779798,K Nearest Neighbors
9,0.77037,0.749068,0.765879,0.77037,Support Vector Machine
7,0.749899,0.724101,0.749291,0.749899,Gradient Boosting
3,0.746397,0.745234,0.744187,0.746397,Decision Tree
8,0.744175,0.71614,0.746941,0.744175,XG Boost
0,0.734007,0.704561,0.725226,0.734007,Logistic Regression
6,0.727407,0.700264,0.719085,0.727407,AdaBoost
2,0.54229,0.589444,0.694734,0.54229,Naive Bayes


## Conclusions

I will go more in depth on some of the best performing models. I intend to tune their hyperparameters with GridSearchCV and find the best performing model. I will look further into KNearestNeighbors, RandomForests, and SVM. I also will try XGBoost; It didn't perform well here, but it is more sensitive to hyperparameter tuning, so I expect its performance will improve more than the others. Bagging trees also saw a lot of improvement, so I will try bagging these already successful models as well.