<a href="https://colab.research.google.com/github/Data-alchemist-mani/DS-Unit-2-Applied-Modeling/blob/master/module1-define-ml-problems/Copy_of_LS_DS_231_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [0]:
# data set from:
# https://www.kaggle.com/chenzhiliang/housing-data

In [0]:
# file is on my computer
from google.colab import files

In [188]:
uploaded = files.upload()

In [189]:
# getting my category encoder
!pip install category_encoders



In [0]:
# getting all my imports
import pandas as pd
import numpy as np
import seaborn as sns
import category_encoders as ce
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.impute import SimpleImputer


In [0]:
# looking at my dataframe
df = pd.read_csv('Housing_Data.csv')

In [192]:
df.shape

(24564, 11)

In [193]:
df.head()

Unnamed: 0.1,Unnamed: 0,districtText,furnishingCode,value,floorArea,bedrooms,bathrooms,pricePerSqFt,hasFloorplans,latitude,longitude
0,0,Hougang / Punggol / Sengkang,FULL,458000,990,3,2,462.6262626,True,1.402608,103.911427
1,1,Buona Vista / West Coast / Clementi Ne...,FULL,775000,1022,3,2,758.3170254,False,1.306192,103.782499
2,2,Ang Mo Kio / Bishan / Thomson,UNFUR,875000,1259,3,2,694.9960286,True,1.371803482,103.8535492
3,3,Admiralty / Woodlands,UNKNOWN,280000,699,3,2,400.5722461,False,1.442670315,103.7778019
4,4,Hougang / Punggol / Sengkang,PART,410000,1184,3,2,346.2837838,True,1.393140738,103.8883531


In [0]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [195]:
df['value'].describe()

count     24564.000
mean     481441.236
std      156289.207
min      210000.000
25%      375000.000
50%      445000.000
75%      550000.000
max     2150000.000
Name: value, dtype: float64

In [0]:
y = df['value']

In [197]:
y.mean()

481441.23558866634

In [198]:
#clean this column
df['districtText'].unique()

array([' Hougang  /  Punggol  /  Sengkang ',
       ' Buona  Vista  /  West  Coast  /  Clementi  New  Town ',
       ' Ang  Mo  Kio  /  Bishan  /  Thomson ',
       ' Admiralty  /  Woodlands ', ' Boon  Lay  /  Jurong  /  Tuas ',
       ' Pasir  Ris  /  Tampines ', ' Sembawang  /  Yishun ',
       ' Balestier  /  Toa  Payoh ', ' Alexandra  /  Commonwealth ',
       ' Tanglin  /  Holland  /  Bukit  Timah ',
       ' Dairy  Farm  /  Bukit  Panjang  /  Choa  Chu  Kang ',
       ' Seletar  /  Yio  Chu  Kang ', ' Macpherson  /  Potong  Pasir ',
       ' East  Coast  /  Marine  Parade ',
       ' Beach  Road  /  Bugis  /  Rochor ',
       ' Bedok  /  Upper  East  Coast ',
       ' Harbourfront  /  Telok  Blangah ', 'D03', 'D23', 'D21',
       ' Boat  Quay  /  Raffles  Place  /  Marina ',
       ' Farrer  Park  /  Serangoon  Rd ',
       ' Chinatown  /  Tanjong  Pagar ',
       ' Eunos  /  Geylang  /  Paya  Lebar ', 'D12', 'D22', 'D05', 'D19',
       'D25', 'D20', 'D27', 'D02', 'D18',
       '

In [199]:
#catergorical encoder
df['furnishingCode'].unique()

array(['FULL', 'UNFUR', 'UNKNOWN', 'PART'], dtype=object)

In [200]:
df.columns

Index(['Unnamed: 0', 'districtText', 'furnishingCode', 'value', 'floorArea',
       'bedrooms', 'bathrooms', 'pricePerSqFt', 'hasFloorplans', 'latitude',
       'longitude'],
      dtype='object')

In [0]:
# checking missing values

def value_checher(X):

  col_to_check = ['floorArea','bedrooms', 'bathrooms',
                  'pricePerSqFt', 'hasFloorplans',
                  'latitude','longitude']
  
  for col in col_to_check:
    print(f'{X[col]} \n is missing this {X[col].isnull().value_counts()} \n')

  return X

In [202]:
value_checher(df)

0         990
1        1022
2        1259
3         699
4        1184
         ... 
24559    1119
24560    1593
24561    1388
24562     990
24563     990
Name: floorArea, Length: 24564, dtype: object 
 is missing this False    24564
Name: floorArea, dtype: int64 

0        3
1        3
2        3
3        3
4        3
        ..
24559    3
24560    4
24561    3
24562    3
24563    3
Name: bedrooms, Length: 24564, dtype: int64 
 is missing this False    24564
Name: bedrooms, dtype: int64 

0              2
1              2
2              2
3              2
4              2
          ...   
24559    UNKNOWN
24560    UNKNOWN
24561    UNKNOWN
24562          2
24563          2
Name: bathrooms, Length: 24564, dtype: object 
 is missing this False    24564
Name: bathrooms, dtype: int64 

0        462.6262626
1        758.3170254
2        694.9960286
3        400.5722461
4        346.2837838
            ...     
24559    446.8275246
24560    351.5379787
24561     288.184438
24562    535.353535

Unnamed: 0.1,Unnamed: 0,districtText,furnishingCode,value,floorArea,bedrooms,bathrooms,pricePerSqFt,hasFloorplans,latitude,longitude
0,0,Hougang / Punggol / Sengkang,FULL,458000,990,3,2,462.6262626,True,1.402608,103.911427
1,1,Buona Vista / West Coast / Clementi Ne...,FULL,775000,1022,3,2,758.3170254,False,1.306192,103.782499
2,2,Ang Mo Kio / Bishan / Thomson,UNFUR,875000,1259,3,2,694.9960286,True,1.371803482,103.8535492
3,3,Admiralty / Woodlands,UNKNOWN,280000,699,3,2,400.5722461,False,1.442670315,103.7778019
4,4,Hougang / Punggol / Sengkang,PART,410000,1184,3,2,346.2837838,True,1.393140738,103.8883531
...,...,...,...,...,...,...,...,...,...,...,...
24559,25093,Dairy Farm / Bukit Panjang / Choa Chu ...,UNKNOWN,500000,1119,3,UNKNOWN,446.8275246,False,1.383069072,103.7437925
24560,25094,Dairy Farm / Bukit Panjang / Choa Chu ...,UNKNOWN,560000,1593,4,UNKNOWN,351.5379787,False,1.392803442,103.7413797
24561,25095,Dairy Farm / Bukit Panjang / Choa Chu ...,UNKNOWN,400000,1388,3,UNKNOWN,288.184438,False,1.393122893,103.7442056
24562,25096,Hougang / Punggol / Sengkang,PART,530000,990,3,2,535.3535354,False,1.385332889,103.8938411


In [0]:
cardinalilty = ['DistrictText']

In [0]:
train = df

In [0]:
target = 'value'

features = train.columns.drop([target, 'districtText', 'hasFloorplans', 'Unnamed: 0'])

In [206]:
train

Unnamed: 0.1,Unnamed: 0,districtText,furnishingCode,value,floorArea,bedrooms,bathrooms,pricePerSqFt,hasFloorplans,latitude,longitude
0,0,Hougang / Punggol / Sengkang,FULL,458000,990,3,2,462.6262626,True,1.402608,103.911427
1,1,Buona Vista / West Coast / Clementi Ne...,FULL,775000,1022,3,2,758.3170254,False,1.306192,103.782499
2,2,Ang Mo Kio / Bishan / Thomson,UNFUR,875000,1259,3,2,694.9960286,True,1.371803482,103.8535492
3,3,Admiralty / Woodlands,UNKNOWN,280000,699,3,2,400.5722461,False,1.442670315,103.7778019
4,4,Hougang / Punggol / Sengkang,PART,410000,1184,3,2,346.2837838,True,1.393140738,103.8883531
...,...,...,...,...,...,...,...,...,...,...,...
24559,25093,Dairy Farm / Bukit Panjang / Choa Chu ...,UNKNOWN,500000,1119,3,UNKNOWN,446.8275246,False,1.383069072,103.7437925
24560,25094,Dairy Farm / Bukit Panjang / Choa Chu ...,UNKNOWN,560000,1593,4,UNKNOWN,351.5379787,False,1.392803442,103.7413797
24561,25095,Dairy Farm / Bukit Panjang / Choa Chu ...,UNKNOWN,400000,1388,3,UNKNOWN,288.184438,False,1.393122893,103.7442056
24562,25096,Hougang / Punggol / Sengkang,PART,530000,990,3,2,535.3535354,False,1.385332889,103.8938411


In [0]:
train, test = train_test_split(train,
                               train_size=0.80,
                               test_size=0.20,
                               stratify=train['furnishingCode'],
                               random_state=42)

In [208]:
train.shape, test.shape

((19651, 11), (4913, 11))

In [0]:
train, val = train_test_split(train,
                              train_size=0.80,
                              test_size=0.20,
                              stratify=train['furnishingCode'],
                              random_state=43)

In [210]:
train.shape, val.shape, test.shape 

((15720, 11), (3931, 11), (4913, 11))

In [0]:
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]
y_test = test[target]

In [212]:
X_train.shape, X_val.shape, X_test.shape

((15720, 7), (3931, 7), (4913, 7))

In [0]:
pipeline = make_pipeline(
    # ce.OneHotEncoder(use_cat_names=True),
    ce.OrdinalEncoder('furnishingCode'),
    SimpleImputer(strategy='most_frequent'),
    RandomForestRegressor(random_state=99, n_jobs=-1)
    )

In [224]:
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['furnishingCode', 'floorArea',
                                      'bathrooms', 'pricePerSqFt', 'latitude',
                                      'longitude'],
                                drop_invariant=False, handle_missing='value',
                                handle_unknown='value',
                                mapping=[{'col': 'furnishingCode',
                                          'data_type': dtype('O'),
                                          'mapping': FULL       1
UNKNOWN    2
PART       3
UNFUR      4
NaN       -2
dtype: int64},
                                         {'col': 'floorArea',
                                          'data_typ...
                 RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                       criterion='mse', max_depth=None,
                                       max_features='auto', max_leaf_nodes=None,
  

In [225]:
pipeline.score(X_val, y_val)

0.5057057857030889