<a href="https://colab.research.google.com/github/AsifKarimShaik/AsifKarimShaik/blob/main/1_DTC_CAHousing_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DTC with CA Housing
------------------------------
**Dr. Dave Wanik - University of Connecticut**

This is a bare-bones script to get you up and running. For a given dataset, you should be able to code relevant content in the cells below.

Literally... everything is the same until fitting the model... it's not that hard!

## Getting started
Import modules, mount Drive, read in the data, check data types and missing values. You may also do some light EDA prior to modeling.

Notice how we are using the same functions over and over again... it really is like following a recipe.


In [None]:
# import modules we need for EDA and wrangling
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# these functions are useful for splitting and normalization
from sklearn.preprocessing import  MinMaxScaler
from sklearn.model_selection import train_test_split

# import other functions we'll need for classification modeling
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# classification error metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
# # mount your google drive
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# read in some data
df = pd.read_csv('/content/sample_data/california_housing_train.csv')

In [None]:
# data type, shape and columns

print("This is the shape :\n", df.shape, '\n') # escape characters are fun! \n adds a return line
print("These are the column names: \n", df.columns, '\n') # helps keep things nice and clean
print("These are the data types: \n", df.dtypes)

# right away you see this 'valid' column is an 'object', which means it's a string
# AKA something you can't do math on... look at the head and see what's going on
# in that column. ALL OTHER DATA IS NUMERIC (float64)
# this is a good quiz question... ;)

print("\nThis is the head:") # see how I can pop that \n anywhere?
df.head()   # also note how much stuff I have pasted in ONE CELL...
            # now you are cooking with gas!

# the head confirms to use that the valid column is some type of time/date stamp
# we need to stop what we are doing and take care of this NOW!

This is the shape :
 (17000, 9) 

These are the column names: 
 Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value'],
      dtype='object') 

These are the data types: 
 longitude             float64
latitude              float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
dtype: object

This is the head:


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


In [None]:
# here's another cool way to get a lot of this info AND MORE
df.info() # gives you missing values report too - this is nice complete data

# data types, shape, missing values per columm.... pandas rocks...

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           17000 non-null  float64
 1   latitude            17000 non-null  float64
 2   housing_median_age  17000 non-null  float64
 3   total_rooms         17000 non-null  float64
 4   total_bedrooms      17000 non-null  float64
 5   population          17000 non-null  float64
 6   households          17000 non-null  float64
 7   median_income       17000 non-null  float64
 8   median_house_value  17000 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB


## Data splitting
Subset your data into X features and Y target variable for modeling. Convert X and Y to numpy arrays. Then use train_test_split for data splitting (80/20 is very common); don't forget random seed and shuffle.

In [None]:
# one extra step here - we are making up our own problem.
# we want to predict if a house price is greater than the median
# so i will use a numpy.where() statement to do this
df['median_house_value'] = np.where(df['median_house_value'] > df['median_house_value'].median(),
                      1, # high value houses
                      0) # low value houses

# check your work - looks good
df['median_house_value'].head()

0    0
1    0
2    0
3    0
4    0
Name: median_house_value, dtype: int64

In [None]:
# the target variable is Y
# we know that this is 'medv'
Y = df['median_house_value']
print(Y.shape) # a single column with 506 rows

(17000,)


In [None]:
# everything else is X
# so just drop 'medv' and you are done
X = df.drop('median_house_value', axis=1)
print(X.shape) # note that we have gone from 14 to 13 columns, this is good! 506 rows.

(17000, 8)


In [None]:
# now, split the data in ONE LINE OF CODE
# notice how we are assigning four different variables at once
# this makes it really clean

# be careful of capital vs. lowercase X and Y, you might get an error...
# notice the 80/20 split we perform

X_train, X_test, y_train, y_test = train_test_split(X, Y,
                                                    test_size = 0.2,
                                                    shuffle = True,
                                                    random_state = 42)

In [None]:
# check your work - does the shape match what you think it should be?
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(13600, 8) (3400, 8) (13600,) (3400,)


In [None]:
# convert these all to numpy arrays
X_train = np.array(X_train)
X_test = np.array(X_test)
y_train = np.array(y_train)
y_test = np.array(y_test)

## Min/Max Scaling
This will ensure all of your X data is between 0 (min) and 1 (max). You will use fit_transform() on the train data first, then fit on the test data. If you don't do this step after splitting, you will have data leakage.

Only scale the X data, not the Y data!

Do yourself a favor and just overwrite X_train and X_test when standardizing, as I do below. Min/max scaling requires that the data are numpy arrays and all numeric data.


**Like this example:**
```
import numpy as np
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```


In [None]:
# you probably have already imported the MinMaxScale at the top of your script
# you should convert to numpy array before scaling
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# if you wanted to run summary stats on these to check the range,
# you would need to convert to a pandas dataframe.
tmp = pd.DataFrame(X_train)
tmp.describe() # notice how all the max values are 1, all min values are 0.

Unnamed: 0,0,1,2,3,4,5,6,7
count,13600.0,13600.0,13600.0,13600.0,13600.0,13600.0,13600.0,13600.0
mean,0.482551,0.327069,0.542284,0.069519,0.083567,0.040048,0.082229,0.23276
std,0.203859,0.226774,0.247018,0.057161,0.065365,0.032523,0.063121,0.131696
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.255341,0.147715,0.333333,0.038381,0.045779,0.022002,0.046045,0.142507
50%,0.592065,0.181722,0.54902,0.055964,0.067349,0.032792,0.067259,0.209383
75%,0.640895,0.549416,0.705882,0.083122,0.100403,0.048292,0.099326,0.293106
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
# if you wanted to run summary stats on these to check the range,
# you would need to convert to a pandas dataframe.
tmp = pd.DataFrame(X_test)
tmp.describe() # notice how all the max values are 1, all min values are 0.

# sometimes it's less, sometimes it's more!

Unnamed: 0,0,1,2,3,4,5,6,7
count,3400.0,3400.0,3400.0,3400.0,3400.0,3400.0,3400.0,3400.0
mean,0.479711,0.331056,0.535704,0.070107,0.083495,0.039726,0.082383,0.23573
std,0.204498,0.228578,0.245906,0.058673,0.065596,0.03073,0.063689,0.131182
min,-0.005086,0.002125,0.019608,0.000343,0.00031,0.000224,0.000164,0.0
25%,0.257375,0.147715,0.333333,0.038856,0.046051,0.02217,0.046538,0.142965
50%,0.587996,0.182784,0.529412,0.056478,0.066574,0.032022,0.066601,0.212035
75%,0.638861,0.552604,0.705882,0.082318,0.100597,0.047675,0.09949,0.296555
max,1.016277,0.98406,1.0,0.744853,0.768312,0.341938,0.758921,1.0


## Fit The Model
Fit the model and make new variables to save your train and test predictions. Make sure you are using the appropriate regression or classification model.

Notice how we're using the same training data and test data for each model - this is critical!

Also note that in many StackOverflow examples, folks don't use 'DTR' or 'LR' as variable names for their models - instead, they often use 'clf' which stands for classifier. Helps you abstract what's going on.

### Logistic Regression

In [None]:
# make a variable to store the general model
LR = LogisticRegression()
# fit the model - one line of code
LR = LR.fit(X_train, y_train)

In [None]:
# store the predictions
train_preds_LR = LR.predict(X_train)
test_preds_LR = LR.predict(X_test)

### DTC

Check out the model documentation:

**DTC Model Documentation:** https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html


Some extra content to think about...

**Link:** https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html

**Link:** https://scikit-learn.org/stable/modules/tree.html

In [None]:
# make a variable to store the general model
# you can accept all of the defaults...
DTC = DecisionTreeClassifier(min_samples_split=15)
# or start to 'tinker'
#DTC = DecisionTreeClassifier(criterion='entropy',
#                            min_samples_split=15) # make this bigger and the tree will shrink!

# fit the model - one line of code
DTC = DTC.fit(X_train, y_train)

In [None]:
# store the predictions
train_preds_DTC = DTC.predict(X_train)
test_preds_DTC = DTC.predict(X_test)

In [None]:
# # show the tree
# # link: https://www.datacamp.com/community/tutorials/decision-tree-classification-python
# from sklearn.tree import export_graphviz
# from six import StringIO
# from IPython.display import Image
# import pydotplus

# dot_data = StringIO()
# export_graphviz(DTC,  # this is the name of your model!
#                 out_file=dot_data,
#                 filled=True, rounded=True,
#                 special_characters=True)
# graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
# Image(graph.create_png())

# # hard to see, but that's OK!
# # we will learn about partial depenence
# # and feature importance with is way easier to see than this

# # for now, just appreciate how awesome this looks!
# # double click and you can see

## Evaluate the Model
Look at the appropriate error metrics depending on the problem you are solving.

For a regression problem, look at the R2, MAE and MSE; then make a scatterplot of actual vs. predicted values with nice labels and titles.

For a classification problem, create the classification report (gives a confusion matrix and useful metrics in one line of code).

See how we are just tacking on a suffix like '_LR' or '_DTR' on the end of things? Keep your code clean and you can just copy paste. Same stuff as before, but we need to be organized since we are introducing more models.

## Confusion Matrix
Look for 'consistency' between the two partitions (train and test)!

### LR


In [None]:
# train confusion matrix
confusion_matrix(y_train, train_preds_LR)

array([[5670, 1182],
       [1171, 5577]])

In [None]:
# test confusion matrix
confusion_matrix(y_test, test_preds_LR)

array([[1377,  279],
       [ 319, 1425]])

### DTC

In [None]:
# train confusion matrix
confusion_matrix(y_train, train_preds_DTC)

array([[6527,  325],
       [ 375, 6373]])

In [None]:
# test confusion matrix
confusion_matrix(y_test, test_preds_DTC)

array([[1426,  230],
       [ 282, 1462]])

## Classification Report

### LR

In [None]:
# train report
trainReport_LR = classification_report(y_train, train_preds_LR)
print(trainReport_LR)

              precision    recall  f1-score   support

           0       0.83      0.83      0.83      6852
           1       0.83      0.83      0.83      6748

    accuracy                           0.83     13600
   macro avg       0.83      0.83      0.83     13600
weighted avg       0.83      0.83      0.83     13600



In [None]:
# test report
testReport_LR = classification_report(y_test, test_preds_LR)
print(testReport_LR)

              precision    recall  f1-score   support

           0       0.81      0.83      0.82      1656
           1       0.84      0.82      0.83      1744

    accuracy                           0.82      3400
   macro avg       0.82      0.82      0.82      3400
weighted avg       0.82      0.82      0.82      3400



### DTC

In [None]:
# train report
trainReport_DTC = classification_report(y_train, train_preds_DTC)
print(trainReport_DTC)

              precision    recall  f1-score   support

           0       0.95      0.95      0.95      6852
           1       0.95      0.94      0.95      6748

    accuracy                           0.95     13600
   macro avg       0.95      0.95      0.95     13600
weighted avg       0.95      0.95      0.95     13600



In [None]:
# test report
testReport_DTC = classification_report(y_test, test_preds_DTC)
print(testReport_DTC)

              precision    recall  f1-score   support

           0       0.83      0.86      0.85      1656
           1       0.86      0.84      0.85      1744

    accuracy                           0.85      3400
   macro avg       0.85      0.85      0.85      3400
weighted avg       0.85      0.85      0.85      3400



# Done!
You have just completed a very simple ML framework for classification modeling. Even though you used a linear regression, you still got great results.

Later on, you will expand on these topics and start fitting multiple models, and may start tweaking them ('hyperparameter tuning') to get even better peformance. Sit tight!