#  Predict which water pumps are faulty

**The goal is to identify with above 60% accuracy which water wells are faulty or non functional.** I will be using data from Taarifa and the Tanzanian Ministry of Water. The submission of my predictions will be in the format of .CSV with columns for 'id' as well as 'status_group'. Lets start by loading the data and getting a feel for it. 

In [1]:
import pandas as pd
import zipfile
zf_labl = zipfile.ZipFile('C:/Users/dakot/Downloads/train_labels.csv.zip')
zf_content = zipfile.ZipFile('C:/Users/dakot/Downloads/train_features.csv.zip')
df_label = pd.read_csv(zf_labl.open(zipfile.ZipFile.namelist(zf_labl)[0])) 
df_feats = pd.read_csv(zf_content.open(zipfile.ZipFile.namelist(zf_content)[0]))

In [2]:
df_label.describe(include='object')

Unnamed: 0,status_group
count,59400
unique,3
top,functional
freq,32259


**The df_label data frame contains the id along with status of the well. The status will be our dependant variable for this project, I believe I will utilize the ID label with the other features to ensure tracability throughout.**

In [3]:
#let us see the whole column profile of the data frame 
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
df_feats.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [4]:
df_feats.shape, df_label.shape

((59400, 40), (59400, 2))

**The df_feats Data Frame contains all the features we will use to predict the status of any given well.** For the first iteration I will run a simple baseline. We see the shape of the features df is 59400 by 40 and the shape of the label df is 59400 by 2. Lets check the distribution of the status of the wells. 

In [5]:
df_label.status_group.value_counts(normalize = True)

functional                 0.543081
non functional             0.384242
functional needs repair    0.072677
Name: status_group, dtype: float64

**Overall it appears the majority of wells are functional(54.3%), non functional is the second highest(38.42%). Functional needing repair rounds out the data set(7.26%).** If I were to make a blind prediction saying that all the wells were functional I would be correct around 54 percent of the time. Not bad but not nearly conclusive or useful for the real world. Lets dig deeper. 

For simplicity I will combine the two data frames to process before splitting. 


In [6]:
full = pd.DataFrame.merge(df_label,df_feats)

In [7]:
full.head()

Unnamed: 0,id,status_group,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,functional,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,functional,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,functional,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,non functional,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,functional,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [8]:
full.isnull().sum()

id                           0
status_group                 0
amount_tsh                   0
date_recorded                0
funder                    3635
gps_height                   0
installer                 3655
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                 371
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            3334
recorded_by                  0
scheme_management         3877
scheme_name              28166
permit                    3056
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_qu

**After merging the data frames together I find there are some Null values that may skew our results or otherwise break our functions during the process.** For now we will drop all instances that are missing values, but later we may impute some values to help our models predict better if neeeded.  

In [9]:
clean = full.dropna(axis = 1)

In [10]:
clean.isna().sum()

id                       0
status_group             0
amount_tsh               0
date_recorded            0
gps_height               0
longitude                0
latitude                 0
wpt_name                 0
num_private              0
basin                    0
region                   0
region_code              0
district_code            0
lga                      0
ward                     0
population               0
recorded_by              0
construction_year        0
extraction_type          0
extraction_type_group    0
extraction_type_class    0
management               0
management_group         0
payment                  0
payment_type             0
water_quality            0
quality_group            0
quantity                 0
quantity_group           0
source                   0
source_type              0
source_class             0
waterpoint_type          0
waterpoint_type_group    0
dtype: int64

**Now that we have no NaN values lets make some test and tarin sets with our data. We want to predict status so we will call that the 'y' variable. All other features will be called our 'X' matrix of features.** 

In [11]:
from sklearn.model_selection import train_test_split
X1 = clean.drop(columns = ['status_group',], axis = 1)
y = clean['status_group']
X_train, X_test, y_train, y_test = train_test_split(X1, y,test_size = .5, random_state=42)

In [12]:
X_train.head()

Unnamed: 0,id,amount_tsh,date_recorded,gps_height,longitude,latitude,wpt_name,num_private,basin,region,region_code,district_code,lga,ward,population,recorded_by,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
142,64130,0.0,2012-10-23,0,32.785025,-5.418031,Kwa Ramadhani,0,Lake Tanganyika,Tabora,14,5,Sikonge,Igigwa,0,GeoData Consultants Ltd,0,india mark ii,india mark ii,handpump,vwc,user-group,never pay,never pay,soft,good,dry,dry,shallow well,shallow well,groundwater,hand pump,hand pump
1056,5968,0.0,2011-06-04,1804,34.767711,-9.089774,Kwa Deo Ngimbusi,0,Rufiji,Iringa,11,4,Njombe,Mdandu,65,GeoData Consultants Ltd,2009,gravity,gravity,gravity,vwc,user-group,pay when scheme fails,on failure,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
54991,53989,0.0,2012-10-08,0,34.53164,-3.727918,Mwabalomolo,0,Internal,Shinyanga,17,6,Meatu,Mwanjoro,0,GeoData Consultants Ltd,0,nira/tanira,nira/tanira,handpump,wug,user-group,never pay,never pay,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump
23651,3849,0.0,2013-01-07,0,32.800493,-5.018881,Kwa Mkonde,0,Lake Tanganyika,Tabora,14,6,Tabora Urban,Chemchem,0,GeoData Consultants Ltd,0,other,other,other,other,other,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
36341,67824,0.0,2013-02-13,486,34.77395,-11.231885,Kwa Mzee Kanyali,0,Lake Nyasa,Ruvuma,10,3,Mbinga,Mbamba bay,60,GeoData Consultants Ltd,2008,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,dry,dry,spring,spring,groundwater,communal standpipe,communal standpipe


In [13]:
X_train.isna().sum().sum()

0

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import category_encoders as ce
import numpy as np 
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
def dummyEncode(df):
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for feature in columnsToEncode:
            try:
                df[feature] = le.fit_transform(df[feature])
            except:
                print('Error encoding '+feature)
        return df
      

In [15]:
X_train_DC = dummyEncode(X_train)
X_train_DC.head()
X_test_DC = dummyEncode(X_test)
X_test_DC.head()
X = dummyEncode(X1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


In [16]:
X_train_DC.isna().sum().sum()
X_train_DC.shape

(29700, 33)

In [17]:
model= LogisticRegression()
model.fit(X_train_DC, y_train)
y_pred = model.predict(X_test_DC)
accuracy_score(y_test, y_pred)




0.6278451178451179

In [18]:
pipeline = make_pipeline(ce.OneHotEncoder(use_cat_names=True),
                         StandardScaler(), LogisticRegression(solver ='lbfgs',n_jobs=-1, multi_class = 'auto',C=2))
pipeline.fit(X_train_DC, y_train)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


Pipeline(memory=None,
     steps=[('onehotencoder', OneHotEncoder(cols=[], drop_invariant=False, handle_unknown='impute',
       impute_missing=True, return_df=True, use_cat_names=True, verbose=0)), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=2, cla...enalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False))])

In [19]:
y_pred = pipeline.predict(X_train)


  Xt = transform.transform(Xt)


In [20]:
pred = pd.DataFrame(y_pred, X_train_DC['id'])

In [21]:
pred.columns = ['status_group']

In [22]:
pred.head()
pred.shape
pred.head()

Unnamed: 0_level_0,status_group
id,Unnamed: 1_level_1
64130,non functional
5968,functional
53989,functional
3849,non functional
67824,non functional


In [23]:
newsub = pd.DataFrame(pred)
newsub.shape
sub_2 = newsub.index
subm = pd.DataFrame( newsub['status_group'],sub_2)
subm.head()
subm.reset_index(inplace = True)

In [24]:
#subm.to_csv('C:/Users/dakot/Documents/GitHub/sumbission1.csv',columns = ['id','status_group'], index = False )

In [25]:
subm.shape

(29700, 2)

Now to make it work for the actual test set. 


In [26]:
zf_test  = zipfile.ZipFile('C:/Users/dakot/Downloads/test_features.csv.zip')
df_test = pd.read_csv(zf_test.open(zipfile.ZipFile.namelist(zf_test)[0])) 

In [27]:
df_test.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,50785,0.0,2013-02-04,Dmdd,1996,DMDD,35.290799,-4.059696,Dinamu Secondary School,0,Internal,Magoma,Manyara,21,3,Mbulu,Bashay,321,True,GeoData Consultants Ltd,Parastatal,,True,2012,other,other,other,parastatal,parastatal,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,other,other
1,51630,0.0,2013-02-04,Government Of Tanzania,1569,DWE,36.656709,-3.309214,Kimnyak,0,Pangani,Kimnyak,Arusha,2,2,Arusha Rural,Kimnyaki,300,True,GeoData Consultants Ltd,VWC,TPRI pipe line,True,2000,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe
2,17168,0.0,2013-02-01,,1567,,34.767863,-5.004344,Puma Secondary,0,Internal,Msatu,Singida,13,2,Singida Rural,Puma,500,True,GeoData Consultants Ltd,VWC,P,,2010,other,other,other,vwc,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,other,other
3,45559,0.0,2013-01-22,Finn Water,267,FINN WATER,38.058046,-9.418672,Kwa Mzee Pange,0,Ruvuma / Southern Coast,Kipindimbi,Lindi,80,43,Liwale,Mkutano,250,,GeoData Consultants Ltd,VWC,,True,1987,other,other,other,vwc,user-group,unknown,unknown,soft,good,dry,dry,shallow well,shallow well,groundwater,other,other
4,49871,500.0,2013-03-27,Bruder,1260,BRUDER,35.006123,-10.950412,Kwa Mzee Turuka,0,Ruvuma / Southern Coast,Losonga,Ruvuma,10,3,Mbinga,Mbinga Urban,60,,GeoData Consultants Ltd,Water Board,BRUDER,True,2000,gravity,gravity,gravity,water board,user-group,pay monthly,monthly,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe


In [28]:
df_test.isna().sum()
nona = df_test.dropna(axis = 1)

In [29]:
nona.shape

(14358, 33)

In [30]:
X = dummyEncode(nona)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


In [31]:
X.head()

Unnamed: 0,id,amount_tsh,date_recorded,gps_height,longitude,latitude,wpt_name,num_private,basin,region,region_code,district_code,lga,ward,population,recorded_by,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,50785,0.0,255,1996,35.290799,-4.059696,633,0,0,8,21,3,62,16,321,0,2012,9,6,3,3,2,0,2,6,2,3,3,5,3,1,6,5
1,51630,0.0,255,1569,36.656709,-3.309214,1727,0,5,0,2,2,0,642,300,0,2000,3,1,0,7,4,0,2,6,2,2,2,8,6,0,1,1
2,17168,0.0,252,1567,34.767863,-5.004344,9483,0,0,18,13,2,108,1659,500,0,2010,9,6,3,7,4,0,2,6,2,2,2,5,3,1,6,5
3,45559,0.0,242,267,38.058046,-9.418672,5467,0,7,7,80,43,48,1178,250,0,1987,9,6,3,7,4,6,6,6,2,0,0,7,5,0,6,5
4,49871,500.0,306,1260,35.006123,-10.950412,5573,0,7,16,10,3,60,1061,60,0,2000,3,1,0,9,4,3,1,6,2,1,1,8,6,0,1,1


In [32]:
pipeline.fit(X_test, y_test)
y_preds = pipeline.predict(X)



  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)


In [33]:
y_preds.shape

(14358,)

In [34]:
preds = pd.DataFrame(y_preds, X['id'])
preds.columns = ['status_group']
preds.head()

Unnamed: 0_level_0,status_group
id,Unnamed: 1_level_1
50785,non functional
51630,functional
17168,non functional
45559,non functional
49871,functional


In [35]:
newsubs = pd.DataFrame(preds)
newsubs.shape
sub_2s = newsubs.index
subms = pd.DataFrame( newsubs['status_group'],sub_2s)
subms.head()
subms.reset_index(inplace = True)

In [36]:
subms.head()

Unnamed: 0,id,status_group
0,50785,non functional
1,51630,functional
2,17168,non functional
3,45559,non functional
4,49871,functional


In [37]:
#subms.to_csv('C:/Users/dakot/Documents/GitHub/sumbission1.csv',columns = ['id','status_group'], index = False )


The above got me a baseline of .53754 on kaggle. we can do better than that. 

In [38]:
from sklearn import tree
from sklearn.metrics import classification_report
 
clf = tree.DecisionTreeClassifier(random_state=42)
clf = clf.fit(X_train, y_train)
 
y_pred2 = clf.predict(X)
#print(classification_report(y_test, y_pred2))
#print('\nAccuracy: {0:.4f}'.format(accuracy_score(y_test, y_pred2)))

In [39]:
y_pred2.shape

(14358,)

Lets try to automate the formatting for submission. 

In [40]:
def format(predictions):
    pre = pd.DataFrame(predictions, X['id'])
    pre.columns = ['status_group']
    new = pd.DataFrame(pre)
    sub_2s = new.index
    subs = pd.DataFrame( new['status_group'],sub_2s)
    subs.reset_index(inplace = True)
    print(subs.head(),subs.shape)
    subs.to_csv('C:/Users/dakot/Documents/GitHub/sumbission1.csv',columns = ['id','status_group'], index = False )
    return 'YAY!'


In [41]:
#format(y_pred2)

# Decision Tree Classifier leads!
kaggle score for the tree without a pipline  = 0.71054
Now lets pipeline this baby!

In [43]:
pipeline = make_pipeline(ce.OneHotEncoder(use_cat_names=True),
                         StandardScaler(), LogisticRegression(solver ='lbfgs',n_jobs=-1, multi_class = 'auto',C=2))
pipeline.fit(X_train, y_train)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


Pipeline(memory=None,
     steps=[('onehotencoder', OneHotEncoder(cols=[], drop_invariant=False, handle_unknown='impute',
       impute_missing=True, return_df=True, use_cat_names=True, verbose=0)), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=2, cla...enalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False))])

In [44]:
pred3 = pipeline.predict(X)

  Xt = transform.transform(Xt)


In [45]:
pred3

array(['non functional', 'functional', 'non functional', ...,
       'non functional', 'functional', 'non functional'], dtype=object)

In [None]:
#format(pred3)

standard scaled one hot encoded log_reg =  0.63769 for tracability  

## Ok for real this time decission tree in a pipeline 


In [57]:
treepipe = make_pipeline(ce.OneHotEncoder(use_cat_names=True),
                         StandardScaler(),tree.DecisionTreeClassifier(random_state=42) )
treepipe.fit(X_train, y_train)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


Pipeline(memory=None,
     steps=[('onehotencoder', OneHotEncoder(cols=[], drop_invariant=False, handle_unknown='impute',
       impute_missing=True, return_df=True, use_cat_names=True, verbose=0)), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('decisiontreeclassifier', DecisionTreeClassifier(...        min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best'))])

In [58]:
tpred = treepipe.predict(X_test)
print(accuracy_score(y_test,tpred))
pred4 = treepipe.predict(X)

0.7174747474747475


  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)


In [59]:
format(pred4)

      id             status_group
0  50785           non functional
1  51630               functional
2  17168  functional needs repair
3  45559           non functional
4  49871               functional (14358, 2)


'YAY!'

score = 0.71040

In [60]:
from sklearn.preprocessing import RobustScaler
treepipe2 = make_pipeline(ce.OneHotEncoder(use_cat_names=True),
                         RobustScaler(),tree.DecisionTreeClassifier(random_state=42) )
treepipe2.fit(X_train, y_train)
pred = treepipe.predict(X_test)

  Xt = transform.transform(Xt)


In [61]:
accuracy_score(y_test,pred)

0.7174747474747475

In [62]:
pred5 = treepipe2.predict(X)

In [63]:
pred5


array(['non functional', 'functional', 'functional needs repair', ...,
       'functional', 'functional', 'non functional'], dtype=object)

In [64]:
format(pred5)

      id             status_group
0  50785           non functional
1  51630               functional
2  17168  functional needs repair
3  45559           non functional
4  49871               functional (14358, 2)


'YAY!'

score = 0.71040

# Ok this far I've dropped all rows with NAN's, lets fix some of the columns and see if that helps

In [66]:
# the training data set 
full.isna().sum()

id                           0
status_group                 0
amount_tsh                   0
date_recorded                0
funder                    3635
gps_height                   0
installer                 3655
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                 371
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            3334
recorded_by                  0
scheme_management         3877
scheme_name              28166
permit                    3056
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_qu